# 🤖 TavilyCrawl Tutorial: Intelligent Web Crawling

> **📚 Part of the LangChain - Develop AI Agents with LangChain & LangGraph**  
> [🎓 Get the full course](https://www.udemy.com/course/langchain/?referralCode=D981B8213164A3EA91AC)

## What is TavilyCrawl?

**TavilyCrawl** is the first intelligent web crawler that uses AI to determine which paths to explore during crawling. It combines AI-powered decision making with parallel processing capabilities.

### Key Features:

- **AI-Powered Path Selection**: Uses AI to determine which paths to explore
- **Parallel Processing**: Explores hundreds of paths simultaneously  
- **Advanced Extraction**: Extracts content from dynamically rendered pages
- **Instruction-Driven**: Follows natural language instructions to guide exploration
- **Targeted Content**: Returns content tailored for LLM integration and RAG systems

In this tutorial, we'll demonstrate TavilyCrawl by comparing different instruction approaches on the **LangChain documentation**:
1. 🔍 **Regular crawl without instructions** - baseline behavior
2. ✅ **Regular crawl with good instructions** - targeted results

---

## 📦 Setup & Installation

First, let's install the required packages and set up our environment.


In [31]:
# Install required packages
%pip install langchain-tavily certifi

# For pretty printing and visualization
%pip install rich pandas json


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [32]:
import os
import ssl
import json
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyCrawl
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.json import JSON

# Configure SSL context
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

# Initialize rich console for pretty printing
console = Console()

print("✅ All imports successful!")

✅ All imports successful!


## 🔑 API Key Setup

You'll need a Tavily API key to use TavilyCrawl. Get yours at [tavily.com](https://app.tavily.com/home).

Set environment variable `TAVILY_API_KEY`


In [33]:
# Set your Tavily API key here
import getpass

# For Google Colab, you can use getpass for secure input
if 'TAVILY_API_KEY' not in os.environ:
    os.environ['TAVILY_API_KEY'] = getpass.getpass('Enter your Tavily API key: ')

# Alternative: Set directly (uncomment and add your key)
# os.environ["TAVILY_API_KEY"] = "your_tavily_api_key_here"

print("✅ API key set successfully!")

✅ API key set successfully!


## 🚀 Initialize TavilyCrawl

Let's initialize TavilyCrawl and set up our target URL for demonstration.

In [34]:
# Initialize TavilyCrawl
tavily_crawl = TavilyCrawl()

# Target URL: LangChain Documentation
target_url = "https://python.langchain.com/"

console.print(Panel.fit(
    f"🎯 **Target Website**: {target_url}\n🤖 **Crawler**: TavilyCrawl",
    title="Demo Setup",
    border_style="bright_blue"
))

print("TavilyCrawl initialized successfully")

TavilyCrawl initialized successfully


## 🔍 Demo 1: Regular Crawl Without Instructions

First, let's see what happens when we use TavilyCrawl without any specific instructions. This will show us the baseline behavior on the LangChain documentation.

In [None]:
# Demo 1: Crawl without instructions
console.print(Panel.fit(
    f"🎯 **Target**: {target_url}\n📋 **Instructions**: None (baseline crawl)\n⚙️ **Max Depth**: 1\n🎨 **Extract Depth**: advanced",
    title="Demo 1: Regular Crawl Without Instructions",
    border_style="blue"
))

console.print("Running TavilyCrawl without instructions...", style="blue")

# Basic crawl without instructions
basic_result = tavily_crawl.invoke({
    "url": target_url,
    "max_depth": 1,
    "extract_depth": "advanced"
})

# Show raw output immediately
console.print(basic_result)

# Extract results for analysis
basic_results = basic_result.get("results", [])



# Now display the formatted results nicely


In [None]:
console.print(f"\n📊 **Results Without Instructions**: {len(basic_results)} pages", style="cyan")
console.print("   📄 Mix of all content types from LangChain docs")
console.print("   🔍 No filtering - everything from the crawled sections")
console.print("   ⚠️  Requires manual work to find specific content")

console.print("\n📋 **Sample Results from Basic Crawl (No Filtering):**\n", style="cyan")

for i, result in enumerate(basic_results[:3], 1):  # Show first 3 results
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:150] + "..."
    
    panel_content = f"""🔗 **URL**: {url}

📖 **Content Preview**:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"📄 {i}. {url}",
        border_style="blue"
    ))
    print()

console.print(f"... and {len(basic_results) - 3} more mixed results", style="italic cyan")
console.print("🔍 **Note**: Mixed content types - guides, integrations, concepts, etc.", style="cyan")










## ✅ Demo 2: Regular Crawl With Good Instructions

Now let's see how good instructions can dramatically improve the quality and relevance of our crawl results. We'll use specific, action-oriented instructions to target exactly what we're looking for.


In [None]:
good_instructions = "Find all pages about ai agents"

console.print(Panel.fit(
    f"🎯 **Target**: {target_url} (same as Demo 1)\n📋 **Instructions**: {good_instructions}\n✅ **Type**: Good (specific, action-oriented)\n⚙️ **Max Depth**: 2\n🎨 **Extract Depth**: advanced",
    title="Demo 2: Regular Crawl With Good Instructions", 
    border_style="green"
))

console.print("Starting crawl with good instructions...", style="green")
console.print("Instructions will guide the AI to target specific content", style="italic")

In [None]:
# Execute the crawl with good instructions
good_result = tavily_crawl.invoke({
    "url": target_url,
    "instructions": good_instructions,
    "max_depth": 2,
    "extract_depth": "advanced"
})

# Show raw output immediately
console.print("\n🔍 **Raw TavilyCrawl Output:**", style="yellow")
result_json = JSON.from_data(good_result)
console.print(result_json)

console.print("\nCrawl with good instructions completed", style="green")

# Show the results of instruction-based filtering
good_results = good_result.get("results", [])

In [None]:
# Display the targeted agent documentation found
console.print("\n🎯 **LangChain Agent Documentation Found:**\n", style="green")

for i, result in enumerate(good_results, 1):
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:200] + "..."
    
    panel_content = f"""🔗 **URL**: {url}

📖 **Content Preview**:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"📑 {i}. {url}",
        border_style="green"
    ))
    print()

console.print("📝 **Note**: All results are specifically about agents in LangChain", style="green")




## 📊 Comparison of Both Approaches

Now let's compare both approaches side by side to understand the impact of instruction quality.

In [None]:
# Create comparison table
comparison_table = Table(title="📊 TavilyCrawl: Instruction Quality Comparison")
comparison_table.add_column("Approach", style="cyan", no_wrap=True)
comparison_table.add_column("Instructions", style="yellow")
comparison_table.add_column("Pages Found", style="blue")
comparison_table.add_column("Content Quality", style="green")
comparison_table.add_column("Usefulness", style="red")

comparison_table.add_row(
    "🔍 No Instructions",
    "None (baseline)",
    f"{len(basic_results)}",
    "Mixed (all types)",
    "Low (requires filtering)"
)



comparison_table.add_row(
    "✅ Good Instructions",
    good_instructions,
    f"{len(good_results)}",
    "Highly targeted",
    "High (ready to use)"
)

console.print(comparison_table)

console.print("\n🎯 **Key Observations:**", style="blue")
console.print("   🔍 **No instructions** return everything, requiring manual filtering")
console.print("   ✅ **Good instructions** provide highly targeted, ready-to-use results")
console.print("   💡 **Best practice**: Use specific, action-oriented instructions")

console.print(f"\n📈 **Efficiency with Good Instructions:**", style="green")
console.print(f"   🎯 Filtering efficiency: {((len(basic_results) - len(good_results)) / len(basic_results) * 100):.1f}% reduction in noise")
console.print("   ⚡ Time saved: No manual post-processing required")
console.print("   🧠 AI-powered: Intelligent path selection and content filtering")