# 08 - External Index Retrievers üåê

## Learning Objectives üéØ

In this notebook, you'll learn:

1. **What are External Index Retrievers** and how they differ from vector store retrievers
2. **ArxivRetriever** - Search and retrieve scholarly articles from arxiv.org
3. **WikipediaRetriever** - Access Wikipedia articles for general knowledge
4. **TavilySearchAPIRetriever** - Perform real-time internet searches
5. **Integration with RAG Chains** - Combine external retrievers with LLMs
6. **Best Practices** - When and how to use each retriever effectively

---

## Table of Contents üìö

1. [Introduction to External Retrievers](#intro)
2. [Setup & Installation](#setup)
3. [ArxivRetriever - Academic Papers](#arxiv)
4. [WikipediaRetriever - General Knowledge](#wikipedia)
5. [TavilySearchAPIRetriever - Web Search](#tavily)
6. [Integration with RAG Chains](#rag)
7. [Comparison & Use Cases](#comparison)
8. [Best Practices](#best-practices)
9. [Summary & Exercises](#summary)

---

<a id='intro'></a>
## 1. Introduction to External Index Retrievers üîç

### What are External Index Retrievers?

**External Index Retrievers** search over external data sources (e.g., the internet, academic databases, knowledge bases) rather than your local vector store.

### Key Differences:

| Feature | Vector Store Retrievers | External Index Retrievers |
|---------|------------------------|---------------------------|
| **Data Source** | Your embedded documents | External databases/APIs |
| **Data Freshness** | Static (at indexing time) | Real-time or regularly updated |
| **Setup Required** | Embedding + Vector store | API keys (sometimes) |
| **Use Cases** | Internal documents, knowledge bases | Current events, academic research, general knowledge |
| **Cost** | Embedding cost + storage | API calls (often free tier available) |

### When to Use External Retrievers:

- ‚úÖ You need **up-to-date information** from the internet
- ‚úÖ You want to access **specialized databases** (e.g., academic papers)
- ‚úÖ You need **general knowledge** without building a custom knowledge base
- ‚úÖ You want to **augment** your local data with external sources

---

<a id='setup'></a>
## 2. Setup & Installation ‚öôÔ∏è

### Required Packages

All external retrievers are part of `langchain-community`. You'll also need:

```bash
uv pip install langchain-community
uv pip install arxiv           # For ArxivRetriever
uv pip install wikipedia       # For WikipediaRetriever
uv pip install tavily-python   # For TavilySearchAPIRetriever
```

### Environment Variables

For TavilySearchAPIRetriever, you'll need an API key:

```
TAVILY_API_KEY=your_api_key_here
```

Get your free API key at: https://tavily.com/

---

In [2]:
# Setup: Import required libraries
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

# Import LangChain components
from langchain_community.retrievers import ArxivRetriever, WikipediaRetriever, TavilySearchAPIRetriever
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Verify versions
import langchain
print(f"‚úÖ LangChain version: {langchain.__version__}")
print("‚úÖ Setup complete!")

‚úÖ LangChain version: 1.1.0
‚úÖ Setup complete!


<a id='arxiv'></a>
## 3. ArxivRetriever - Academic Papers üìÑ

### üî∞ BEGINNER: What is ArxivRetriever?

**ArxivRetriever** searches [arxiv.org](https://arxiv.org), a repository of electronic preprints for research papers in:
- Physics
- Mathematics
- Computer Science
- Quantitative Biology
- Quantitative Finance
- Statistics

### Use Cases:
- üìö Literature review for research
- üß† Getting latest research on AI/ML topics
- üìä Finding papers by specific authors
- üî¨ Accessing cutting-edge research

---

### üî∞ BEGINNER: Basic ArxivRetriever Usage

In [5]:
!uv pip install arxiv

[2mUsing Python 3.13.5 environment at: lcrag_venv[0m
[2mAudited [1m1 package[0m [2min 13ms[0m[0m


In [6]:
import arxiv

# Create an ArxivRetriever instance
# By default, it returns top 3 documents
arxiv_retriever = ArxivRetriever(
	load_max_docs=3,
	arxiv_search=arxiv.Search,
	arxiv_exceptions=arxiv.ArxivError
)

# Search for papers on "large language models"
query = "large language models"
docs = arxiv_retriever.invoke(query)

print(f"üìö Found {len(docs)} papers on '{query}'\n")

# Display first paper
print("=" * 80)
print(f"Title: {docs[0].metadata.get('Title', 'N/A')}")
print(f"Authors: {docs[0].metadata.get('Authors', 'N/A')}")
print(f"Published: {docs[0].metadata.get('Published', 'N/A')}")
print(f"\nAbstract (first 500 chars):\n{docs[0].page_content[:500]}...")
print("=" * 80)
print(f"Title: {docs[1].metadata.get('Title', 'N/A')}")
print("=" * 80)
print(f"Title: {docs[2].metadata.get('Title', 'N/A')}")


üìö Found 3 papers on 'large language models'

Title: Large Language Models Lack Understanding of Character Composition of Words
Authors: Andrew Shin, Kunitake Kaneko
Published: 2024-07-23

Abstract (first 500 chars):
Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple t...
Title: Is Self-knowledge and Action Consistent or Not: Investigating Large Language Model's Personality
Title: Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models


### üéì INTERMEDIATE: Advanced ArxivRetriever Features

In [7]:
# Advanced: Retrieve more documents and explore metadata
arxiv_retriever_advanced = ArxivRetriever(
    load_max_docs=5,  # Get top 5 papers
    load_all_available_meta=True,  # Load all metadata
    arxiv_search=arxiv.Search,
    arxiv_exceptions=arxiv.ArxivError
)

# Search for papers on "transformers attention mechanism"
query = "transformers attention mechanism"
docs = arxiv_retriever_advanced.invoke(query)

print(f"üìö Retrieved {len(docs)} papers\n")

# Display metadata for all papers
for i, doc in enumerate(docs, 1):
    print(f"{i}. {doc.metadata.get('Title', 'N/A')}")
    print(f"   Authors: {doc.metadata.get('Authors', 'N/A')}")
    print(f"   Published: {doc.metadata.get('Published', 'N/A')}")
    print(f"   Entry ID: {doc.metadata.get('entry_id', 'N/A')}")
    print()

üìö Retrieved 3 papers

1. Vision Transformer with Quadrangle Attention
   Authors: Qiming Zhang, Jing Zhang, Yufei Xu, Dacheng Tao
   Published: 2023-03-27
   Entry ID: N/A

2. D√©j√† vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation
   Authors: Jibang Wu, Renqin Cai, Hongning Wang
   Published: 2020-01-29
   Entry ID: N/A

3. Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
   Authors: Nihal Mehta
   Published: 2025-11-16
   Entry ID: N/A



### üéì INTERMEDIATE: Using .batch() for Multiple Queries

In [8]:
# Batch processing: Search multiple topics at once
queries = [
    "RAG retrieval augmented generation",
    "vector embeddings",
    "prompt engineering"
]

arxiv_retriever_batch = ArxivRetriever(
    load_max_docs=3,
    arxiv_search=arxiv.Search,
    arxiv_exceptions=arxiv.ArxivError
)
batch_results = arxiv_retriever_batch.batch(queries)  # Instead of single invoke call for each query 
                                                       # we use batch to process all at once
                                                       #  

print("üìö Batch Search Results:\n")
for query, docs in zip(queries, batch_results):
    print(f"Query: '{query}'")
    print(f"  ‚Üí Found {len(docs)} papers")
    if docs:
        print(f"  ‚Üí Top result: {docs[0].metadata.get('Title', 'N/A')}")
    print()

üìö Batch Search Results:

Query: 'RAG retrieval augmented generation'
  ‚Üí Found 3 papers
  ‚Üí Top result: AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Query: 'vector embeddings'
  ‚Üí Found 3 papers
  ‚Üí Top result: Part-of-Speech Relevance Weights for Learning Word Embeddings

Query: 'prompt engineering'
  ‚Üí Found 3 papers
  ‚Üí Top result: Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey



### üìä Understanding ArxivRetriever Metadata

Each document returned by ArxivRetriever contains rich metadata:

```python
{
    'Published': '2023-06-15',           # Publication date
    'Title': 'Paper Title',              # Full title
    'Authors': 'Author1, Author2',       # Comma-separated authors
    'Summary': 'Abstract text...',       # Paper abstract/summary
    'entry_id': 'http://arxiv.org/...',  # Arxiv URL
}
```

The `page_content` field contains the full abstract/summary of the paper.

---

<a id='wikipedia'></a>
## 4. WikipediaRetriever - General Knowledge üìñ

### üî∞ BEGINNER: What is WikipediaRetriever?

**WikipediaRetriever** searches and retrieves content from Wikipedia, the free encyclopedia with 6+ million articles.

### Use Cases:
- üåç General knowledge questions
- üìö Quick facts and definitions
- üèõÔ∏è Historical information
- üßë‚Äçüî¨ Biographical data
- üó∫Ô∏è Geographic information

### Important Notes:
- ‚ö†Ô∏è Wikipedia content is **community-edited** - verify critical information
- ‚úÖ Great for general knowledge, not for specialized or proprietary data
- üåê Supports multiple languages

---

### üî∞ BEGINNER: Basic WikipediaRetriever Usage

In [11]:
!uv pip install wikipedia

[2K[2mResolved [1m9 packages[0m [2min 1.21s[0m[0m                                         [0m
[2K[2mPrepared [1m1 package[0m [2min 486ms[0m[0m                                              
[2K[2mInstalled [1m1 package[0m [2min 1ms[0m[0m                                  [0m
 [32m+[39m [1mwikipedia[0m[2m==1.4.0[0m


In [None]:
# Create a WikipediaRetriever instance
# By default, it returns top 3 documents
wiki_retriever = WikipediaRetriever(top_k_results=2, wiki_client=None)

# Search for information on "Python programming language"
query = "Python programming language"
docs = wiki_retriever.invoke(query)

print(f"üìñ Found {len(docs)} Wikipedia articles on '{query}'\n")

# Display first result
print("=" * 80)
print(f"Title: {docs[0].metadata.get('title', 'N/A')}")
print(f"Source: {docs[0].metadata.get('source', 'N/A')}")
print(f"\nContent (first 600 chars):\n{docs[0].page_content[:600]}...")
print("=" * 80)

üìñ Found 2 Wikipedia articles on 'Python programming language'

Title: Python (programming language)
Source: https://en.wikipedia.org/wiki/Python_(programming_language)

Content (first 600 chars):
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language. Python 3.0, released in 2008, was a major revision and not completely backward-compatible with earlier versions. Beginning with Python 3.5, capabi...


### üéì INTERMEDIATE: Advanced WikipediaRetriever Features

In [9]:
# Advanced: Control number of results and document length
wiki_retriever_advanced = WikipediaRetriever(
    top_k_results=3,        # Get top 3 results
    doc_content_chars_max=1000,  # Limit content to 1000 characters per doc
    wiki_client=None
)

# Search for "Machine Learning"
query = "Machine Learning"
docs = wiki_retriever_advanced.invoke(query)

print(f"üìñ Retrieved {len(docs)} Wikipedia articles\n")

# Display all results
for i, doc in enumerate(docs, 1):
    print(f"{i}. Title: {doc.metadata.get('title', 'N/A')}")
    print(f"   Summary: {doc.metadata.get('summary', 'N/A')[:150]}...")
    print(f"   Content length: {len(doc.page_content)} characters")
    print()

üìñ Retrieved 3 Wikipedia articles

1. Title: Machine learning
   Summary: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...
   Content length: 1000 characters

2. Title: Neural network (machine learning)
   Summary: In machine learning, a neural network or neural net (NN), also called artificial neural network (ANN), is a computational model inspired by the struct...
   Content length: 1000 characters

3. Title: Attention (machine learning)
   Summary: In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that seq...
   Content length: 1000 characters



### üéì INTERMEDIATE: Multilingual Support

In [10]:
# Search in different languages
# Default is English ('en'), but you can specify other languages

# Example: Search in Spanish
wiki_retriever_es = WikipediaRetriever(
    top_k_results=1,
    lang="es",  # Spanish Wikipedia
    wiki_client=None
)

query = "Inteligencia Artificial"
docs = wiki_retriever_es.invoke(query)

print(f"üåê Search in Spanish Wikipedia: '{query}'\n")
print(f"Title: {docs[0].metadata.get('title', 'N/A')}")
print(f"Content preview:\n{docs[0].page_content[:400]}...")

üåê Search in Spanish Wikipedia: 'Inteligencia Artificial'

Title: Inteligencia artificial
Content preview:
La inteligencia artificial, abreviado como IA, en el contexto de las ciencias de la computaci√≥n, es una disciplina y un conjunto de capacidades cognoscitivas e intelectuales expresadas por sistemas inform√°ticos o combinaciones de algoritmos cuyo prop√≥sito es la creaci√≥n de m√°quinas que imiten la inteligencia humana.
Estas tecnolog√≠as permiten que las m√°quinas aprendan de la experiencia, se adapten...


### üéì INTERMEDIATE: Batch Processing with WikipediaRetriever

In [11]:
# Batch search for multiple topics
queries = [
    "Albert Einstein",
    "Quantum Computing",
    "Neural Networks"
]

wiki_retriever_batch = WikipediaRetriever(top_k_results=1, doc_content_chars_max=500, wiki_client=None)
batch_results = wiki_retriever_batch.batch(queries)

print("üìñ Batch Wikipedia Search Results:\n")
for query, docs in zip(queries, batch_results):
    print(f"Query: '{query}'")
    if docs:
        print(f"  ‚Üí Title: {docs[0].metadata.get('title', 'N/A')}")
        print(f"  ‚Üí Summary: {docs[0].page_content[:200]}...")
    print()

üìñ Batch Wikipedia Search Results:

Query: 'Albert Einstein'
  ‚Üí Title: Albert Einstein
  ‚Üí Summary: Albert Einstein (14 March 1879 ‚Äì 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity. Einstein also made important contributions to quantum theory...

Query: 'Quantum Computing'
  ‚Üí Title: Quantum computing
  ‚Üí Summary: A quantum computer is a (real or theoretical) computer that exploits superposed and entangled states, and the intrinsically non-deterministic outcomes of quantum measurements, as features of its compu...

Query: 'Neural Networks'
  ‚Üí Title: Neural network (machine learning)
  ‚Üí Summary: In machine learning, a neural network or neural net (NN), also called artificial neural network (ANN), is a computational model inspired by the structure and functions of biological neural networks.
A...



### üìä Understanding WikipediaRetriever Metadata

Each document returned by WikipediaRetriever contains:

```python
{
    'title': 'Article Title',           # Wikipedia article title
    'summary': 'Brief summary...',       # Short summary (if available)
    'source': 'https://en.wikipedia...', # Full Wikipedia URL
}
```

The `page_content` field contains the article text (up to `doc_content_chars_max` characters).

---

<a id='tavily'></a>
## 5. TavilySearchAPIRetriever - Web Search üîç

### üî∞ BEGINNER: What is TavilySearchAPIRetriever?

**TavilySearchAPIRetriever** performs **real-time internet searches** using the Tavily Search API, optimized for AI applications.

### Key Features:
- üåê **Real-time web search** - Get the latest information from the internet
- üéØ **AI-optimized** - Returns clean, relevant content for LLMs
- üîí **Source attribution** - Includes URLs and metadata
- ‚ö° **Fast & reliable** - Built specifically for AI use cases

### Use Cases:
- üì∞ Current events and news
- üíπ Stock prices and market data
- üå¶Ô∏è Weather information
- üè¢ Company information
- üîß Technical documentation and tutorials

### Getting Started:
1. Sign up at https://tavily.com/ (free tier available)
2. Get your API key
3. Add to `.env` file: `TAVILY_API_KEY=your_api_key`

---

### üî∞ BEGINNER: Basic TavilySearchAPIRetriever Usage

In [12]:
!uv pip install tavily-python

[2mUsing Python 3.13.5 environment at: lcrag_venv[0m
[2mAudited [1m1 package[0m [2min 15ms[0m[0m


In [9]:
# Create a TavilySearchAPIRetriever instance
# Make sure TAVILY_API_KEY is set in your .env file

tavily_retriever = TavilySearchAPIRetriever(k=3)  # Return top 3 results

# Search for "latest developments in artificial intelligence 2024"
query = "latest developments in artificial intelligence 2024"
docs = tavily_retriever.invoke(query)

print(f"üîç Found {len(docs)} web results for '{query}'\n")

# Display first result
print("=" * 80)
print(f"Source: {docs[0].metadata.get('source', 'N/A')}")
print(f"\nContent (first 500 chars):\n{docs[0].page_content[:500]}...")
print("=" * 80)

üîç Found 3 web results for 'latest developments in artificial intelligence 2024'

Source: https://www.launchconsulting.com/posts/the-future-of-business-ai-innovations-to-watch-in-2024

Content (first 500 chars):
AI Trends in 2024 ¬∑ 1. Generative AI: Beyond Chatbots ¬∑ 2. The Emergence of Small Language Models ¬∑ 3. Multi-Modal AI Experiences ¬∑ 4. AI Empowerment for...


### üéì INTERMEDIATE: Advanced TavilySearchAPIRetriever Features

In [20]:
# Advanced: Control search depth and domain filtering
from langchain_community.retrievers import TavilySearchAPIRetriever
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# If you want to set the API key directly in the notebook (for testing only), uncomment and set your key:
# import os
os.environ["TAVILY_API_KEY"] = "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"

# TavilyClient is not needed; TavilySearchAPIRetriever handles API key automatically.

# Advanced configuration
tavily_retriever_advanced = TavilySearchAPIRetriever(
    k=5,  # Return top 5 results
    # search_depth="advanced",  # "basic" or "advanced" (more thorough)
    # include_domains=["github.com", "stackoverflow.com"],  # Filter to specific domains
    # exclude_domains=["example.com"]  # Exclude specific domains
)

# Search for "LangChain tutorials"
query = "LangChain tutorials"
docs = tavily_retriever_advanced.invoke(query)
print(f"üîç Retrieved {len(docs)} web results\n")

# Display all results with sources
for i, doc in enumerate(docs, 1):
    print(f"{i}. Source: {doc.metadata.get('source', 'N/A')}")
    print(f"   Content preview: {doc.page_content[:200]}...")
    print()

üîç Retrieved 5 web results

1. Source: https://www.youtube.com/watch?v=yF9kGESAi3M
   Content preview: LangChain Master Class For Beginners 2024 [+20 Examples, LangChain V0.2]
aiwithbrandon
84100 subscribers
12977 likes
689864 views
22 Jun 2024
üöÄ Pre-order Shipkit.ai - AI dev toolkit for AI-driven deve...

2. Source: https://www.datacamp.com/tutorial/how-to-build-llm-applications-with-langchain
   Content preview: Explore the untapped potential of Large Language Models with LangChain, an open-source Python framework for building advanced AI applications. Here, we explore LangChain - An open-source Python framew...

3. Source: https://github.com/gkamradt/langchain-tutorials
   Content preview: 2. LangChain CookBook Part 2: 9 Use Cases - Code, Video | Kor | Eugene Yurtsev | üêí Intermediate | ‚úÖ Code | This is a half-baked prototype that ‚Äúhelps‚Äù you extract structured data from text using large...

4. Source: https://langchain-5e9cc07a.mintlify.app/oss/python/learn
   Content 

### üéì INTERMEDIATE: Real-Time Information Retrieval

In [12]:
# Example: Get current information (news, weather, stock prices, etc.)
from datetime import datetime

current_date = datetime.now().strftime("%B %d, %Y")

# Real-time queries
queries = [
    f"latest AI news {current_date}",
    "current weather in San Francisco",
    "NVIDIA stock price today"
]

tavily_realtime = TavilySearchAPIRetriever(k=2)

print(f"üïê Real-Time Information (as of {current_date}):\n")

for query in queries:
    docs = tavily_realtime.invoke(query)
    print(f"Query: '{query}'")
    if docs:
        print(f"  ‚Üí {docs[0].page_content[:250]}...")
        print(f"  ‚Üí Source: {docs[0].metadata.get('source', 'N/A')}")
    print()

üïê Real-Time Information (as of November 23, 2025):

Query: 'latest AI news November 23, 2025'
  ‚Üí Catch up on select AI news and developments from the past week or so: Google launches Gemini 3 and bakes it into search from Day One....
  ‚Üí Source: https://www.marketingprofs.com/opinions/2025/54030/ai-update-november-21-2025-ai-news-and-views-from-the-past-week

Query: 'current weather in San Francisco'
  ‚Üí {'location': {'name': 'San Francisco', 'region': 'California', 'country': 'United States of America', 'lat': 37.775, 'lon': -122.4183, 'tz_id': 'America/Los_Angeles', 'localtime_epoch': 1763878604, 'localtime': '2025-11-22 22:16'}, 'current': {'last_...
  ‚Üí Source: https://www.weatherapi.com/

Query: 'NVIDIA stock price today'
  ‚Üí The NVIDIA Corporation Common Stock (NVDA) stock price today is $180.03, reflecting a -0.49% move since the market opened. The company's market capitalization...
  ‚Üí Source: https://www.kraken.com/stocks/nvda



### üìä Understanding TavilySearchAPIRetriever Metadata

Each document returned by TavilySearchAPIRetriever contains:

```python
{
    'source': 'https://example.com/...',  # Source URL
    'score': 0.95,                         # Relevance score (0-1)
    'title': 'Page Title',                 # Web page title (if available)
}
```

The `page_content` field contains the extracted text content from the web page.

---

<a id='rag'></a>
## 6. Integration with RAG Chains üîó

Now let's combine external retrievers with LLMs to build powerful **Retrieval-Augmented Generation (RAG)** systems!

### üî∞ BEGINNER: Simple QA Chain with External Retriever

In [23]:
# Build a simple RAG chain using WikipediaRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize components
wiki_retriever = WikipediaRetriever(top_k_results=2, doc_content_chars_max=2000, wiki_client=None)
# llm = ChatOpenAI(model="gpt-5-nano", temperature=0)  DO NOT Have access to gpt-5-nano
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create prompt template
template = """Answer the question based on the following context from Wikipedia:

# Context will be coming from extrenal retriever 

Context:
{context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain using LCEL _ Langchain Expression Language
rag_chain = (
    {"context": wiki_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
question = "What is quantum computing and how does it work?"
answer = rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer: {answer}")

Question: What is quantum computing and how does it work?

Answer: Quantum computing is a type of computing that utilizes the principles of quantum mechanics, specifically superposition and entanglement, to process information. Unlike classical computers, which operate on deterministic rules and use bits as the basic unit of information (where each bit can be either 0 or 1), quantum computers use qubits. A qubit can exist in a superposition of both states simultaneously, allowing quantum computers to explore a vast number of possibilities at once.

The operation of a quantum computer involves manipulating qubits in such a way that their quantum states can interfere with one another. This interference can amplify the probability of obtaining the desired measurement result when the qubits are measured. The design of quantum algorithms focuses on creating procedures that leverage this amplification effect to perform calculations more efficiently than classical algorithms.

Quantum compute

### üéì INTERMEDIATE: Multi-Source RAG Chain

In [25]:
# Advanced: Combine multiple retrievers for comprehensive answers
from langchain_core.runnables import RunnableParallel

# Initialize multiple retrievers
import arxiv

arxiv_retriever = ArxivRetriever(
    load_max_docs=2,
    arxiv_search=arxiv.Search,
    arxiv_exceptions=arxiv.ArxivError
)
wiki_retriever = WikipediaRetriever(top_k_results=2, doc_content_chars_max=1500, wiki_client=None)

# Function to combine results from multiple retrievers
def multi_retriever(query):
    """Retrieve from multiple sources and combine results."""
    arxiv_docs = arxiv_retriever.invoke(query)
    wiki_docs = wiki_retriever.invoke(query)
    
    # Combine and format
    all_docs = []
    
    if arxiv_docs:
        all_docs.append("=== Academic Papers (ArXiv) ===")
        all_docs.extend([doc.page_content[:500] for doc in arxiv_docs])
    
    if wiki_docs:
        all_docs.append("\n=== General Knowledge (Wikipedia) ===")
        all_docs.extend([doc.page_content[:500] for doc in wiki_docs])
    
    return "\n\n".join(all_docs)   # This ( all_docs) is having the combined Context 

# Create multi-source RAG chain
multi_source_template = """Answer the question using information from multiple sources below:

{context}

Question: {question}

Provide a comprehensive answer that synthesizes information from both academic and general sources:"""

multi_prompt = ChatPromptTemplate.from_template(multi_source_template)  # This is just a variable name

multi_rag_chain = (
    {"context": multi_retriever, "question": RunnablePassthrough()}
    | multi_prompt
    | llm
    | StrOutputParser()
)

# Ask a question
question = "What are transformers in machine learning?"
answer = multi_rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer (from multiple sources):\n{answer}")

Question: What are transformers in machine learning?

Answer (from multiple sources):
Transformers are a type of artificial neural network architecture that has revolutionized the field of machine learning, particularly in natural language processing (NLP) and other domains requiring sequential data analysis. Introduced in the landmark paper "Attention Is All You Need" by researchers at Google in 2017, transformers utilize a mechanism known as multi-head attention to process input data more effectively than previous architectures, such as recurrent neural networks (RNNs).

At the core of the transformer architecture is the concept of attention, which allows the model to weigh the importance of different tokens (words or elements) in a sequence when making predictions. This is achieved through a process where each token is converted into a numerical representation called a vector, using a word embedding table. The multi-head attention mechanism enables the model to consider multiple asp

### üéì INTERMEDIATE: Real-Time RAG with TavilySearchAPIRetriever

In [26]:
# Build a RAG chain that uses real-time web search
tavily_retriever = TavilySearchAPIRetriever(k=3)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create prompt for real-time information
realtime_template = """Based on the latest information from the web:

{context}

Question: {question}

Provide an up-to-date answer having atleast 500 words with source attribution:"""

realtime_prompt = ChatPromptTemplate.from_template(realtime_template)

# Build real-time RAG chain
realtime_rag_chain = (
    {"context": tavily_retriever | format_docs, "question": RunnablePassthrough()}
    | realtime_prompt
    | llm
    | StrOutputParser()
)

# Ask a current events question
question = "What are the latest developments in AI regulation?"
answer = realtime_rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer (from real-time web search):\n{answer}")

Question: What are the latest developments in AI regulation?

Answer (from real-time web search):
As of the latest information available, there have been significant developments in AI regulation globally, with a focus on ensuring ethical and responsible use of artificial intelligence technologies. One of the most notable developments is the regulation passed by the European Union, which will come into effect on August 1, 2024. This regulation aims to establish harmonized rules for AI across all 27 EU member states, with the goal of promoting trust and transparency in AI systems.

The EU regulation on AI is expected to address various aspects of AI development and deployment, including data privacy, algorithmic transparency, and accountability. It will set out clear guidelines for the use of AI in different sectors, such as healthcare, finance, and transportation, to ensure that AI systems are developed and used in a way that respects fundamental rights and values.

In addition to the 

---

<a id='comparison'></a>
## 7. Comparison & Use Cases üìä

### Retriever Comparison Table

| Feature | ArxivRetriever | WikipediaRetriever | TavilySearchAPIRetriever |
|---------|----------------|-------------------|-------------------------|
| **Data Source** | Academic papers (arxiv.org) | Wikipedia articles | Real-time web search |
| **API Key Required** | ‚ùå No | ‚ùå No | ‚úÖ Yes (free tier) |
| **Data Freshness** | Recent research | Regularly updated | Real-time |
| **Best For** | Academic research, ML papers | General knowledge, definitions | Current events, news |
| **Content Type** | Research papers, abstracts | Encyclopedia articles | Web pages, news |
| **Default Results** | 3 papers | 3 articles | 5 results |
| **Multilingual** | ‚ùå No | ‚úÖ Yes (300+ languages) | ‚úÖ Yes |
| **Metadata** | Title, Authors, Published date | Title, Summary, URL | Source URL, Score |
| **Rate Limits** | Moderate | Moderate | API-dependent |
| **Cost** | üÜì Free | üÜì Free | üÜì Free tier + paid |

---

### When to Use Each Retriever

#### ‚úÖ Use **ArxivRetriever** when:
- You need peer-reviewed academic research
- You're building an AI/ML research assistant
- You want the latest scientific papers
- You need citations and author information

#### ‚úÖ Use **WikipediaRetriever** when:
- You need general knowledge and definitions
- You want historical or biographical information
- You're building an educational chatbot
- You need multilingual support
- You want reliable, community-edited content

#### ‚úÖ Use **TavilySearchAPIRetriever** when:
- You need real-time, up-to-date information
- You're answering current events questions
- You want to search the broader internet
- You need to filter by specific domains
- Your use case requires the latest data

---

### Combining Retrievers (Hybrid Approach)

For the most comprehensive RAG system:

```python
# Pseudo-code for hybrid retrieval
if query_type == "academic":
    use ArxivRetriever
elif query_type == "general_knowledge":
    use WikipediaRetriever
elif query_type == "current_events":
    use TavilySearchAPIRetriever
else:
    # Use multiple retrievers and combine results
    combine(ArxivRetriever, WikipediaRetriever, TavilySearchAPIRetriever)
```

---

<a id='best-practices'></a>
## 8. Best Practices üí°

### General Best Practices

#### 1. **Handle Errors Gracefully**

```python
try:
    docs = retriever.invoke(query)
except Exception as e:
    print(f"Error retrieving documents: {e}")
    docs = []  # Fallback to empty list
```

#### 2. **Set Appropriate Limits**

```python
# Don't retrieve too many documents (costs, latency)
arxiv_retriever = ArxivRetriever(load_max_docs=3)  # ‚úÖ Good
arxiv_retriever = ArxivRetriever(load_max_docs=100)  # ‚ùå Too many
```

#### 3. **Cache Results for Repeated Queries**

```python
# Use a simple cache to avoid redundant API calls
from functools import lru_cache

@lru_cache(maxsize=100)
def cached_search(query: str):
    return retriever.invoke(query)
```

#### 4. **Verify Source Attribution**

```python
# Always include sources in your responses
for doc in docs:
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
```

#### 5. **Combine with Vector Store Retrievers**

```python
# Use external retrievers for general knowledge
# Use vector stores for your proprietary data
def hybrid_retrieve(query):
    external_docs = wiki_retriever.invoke(query)
    internal_docs = vector_store.similarity_search(query)
    return external_docs + internal_docs
```

---

### Retriever-Specific Best Practices

#### ArxivRetriever:
- ‚úÖ Use specific search terms (e.g., "BERT transformers" vs "AI")
- ‚úÖ Limit results to 3-5 papers for LLM context
- ‚úÖ Extract metadata for citations
- ‚ùå Don't use for non-academic queries

#### WikipediaRetriever:
- ‚úÖ Use for general knowledge, not specialized topics
- ‚úÖ Set `doc_content_chars_max` to avoid huge documents
- ‚úÖ Verify information for critical use cases
- ‚ùå Don't rely on Wikipedia for real-time information

#### TavilySearchAPIRetriever:
- ‚úÖ Monitor API usage (rate limits, costs)
- ‚úÖ Use for time-sensitive queries
- ‚úÖ Filter by domain for specific sources
- ‚ùå Don't use for queries that don't need real-time data

---

### Performance Tips

1. **Use `.batch()` for multiple queries**
   ```python
   # ‚úÖ Efficient
   results = retriever.batch([q1, q2, q3])
   
   # ‚ùå Inefficient
   results = [retriever.invoke(q) for q in [q1, q2, q3]]
   ```

2. **Limit document length for LLM context**
   ```python
   # Truncate long documents to fit LLM context window
   docs = [Document(page_content=doc.page_content[:2000], metadata=doc.metadata) 
           for doc in raw_docs]
   ```

3. **Use async methods for concurrent retrieval** (if supported)
   ```python
   # For async-compatible retrievers
   import asyncio
   docs = await retriever.ainvoke(query)
   ```

---

<a id='summary'></a>
## 9. Summary & Exercises üìù

### üéØ What You Learned

In this notebook, you learned:

‚úÖ **External Index Retrievers** - Search over external data sources (internet, databases)

‚úÖ **ArxivRetriever** - Retrieve academic papers from arxiv.org
   - Use cases: Research, ML papers, citations
   - Methods: `.invoke()`, `.batch()`
   - Metadata: Title, Authors, Published date

‚úÖ **WikipediaRetriever** - Access Wikipedia articles
   - Use cases: General knowledge, definitions, history
   - Features: Multilingual support, customizable length
   - Metadata: Title, Summary, Source URL

‚úÖ **TavilySearchAPIRetriever** - Real-time web search
   - Use cases: Current events, news, real-time data
   - Features: Domain filtering, search depth control
   - Metadata: Source URL, Relevance score

‚úÖ **RAG Integration** - Combined external retrievers with LLMs
   - Built simple QA chains
   - Created multi-source RAG systems
   - Implemented real-time information retrieval

‚úÖ **Best Practices** - Error handling, caching, source attribution

---

### üí™ Practice Exercises

#### Exercise 1: Academic Research Assistant (üî∞ Beginner)
Create a RAG chain that:
- Uses `ArxivRetriever` to find papers on "deep learning"
- Extracts the top 3 paper titles and authors
- Summarizes each paper's abstract using an LLM

#### Exercise 2: Wikipedia Fact Checker (üî∞ Beginner)
Build a system that:
- Takes a statement as input (e.g., "Python was created in 1991")
- Uses `WikipediaRetriever` to search for relevant articles
- Uses an LLM to verify if the statement is accurate

#### Exercise 3: Multi-Source News Aggregator (üéì Intermediate)
Create a RAG chain that:
- Uses `TavilySearchAPIRetriever` to get latest AI news
- Uses `WikipediaRetriever` to get background on AI topics
- Combines both sources to provide a comprehensive news summary

#### Exercise 4: Hybrid Retrieval System (üéì Intermediate)
Build a system that:
- Classifies queries into "academic", "general", or "current_events"
- Routes to the appropriate retriever based on query type
- Returns results from the most relevant source

#### Exercise 5: Multilingual Knowledge Base (üöÄ Advanced)
Create a system that:
- Detects the language of the user's query
- Uses `WikipediaRetriever` with the appropriate language setting
- Returns answers in the user's language

---

### üîó Next Steps

- **Notebook 09**: Advanced Retrieval Techniques (Hybrid Search, Re-ranking)
- **Notebook 10**: Production RAG Systems (Caching, Monitoring, Scaling)
- **LangChain Documentation**: https://python.langchain.com/docs/integrations/retrievers/

---

### üìö Additional Resources

- **ArXiv**: https://arxiv.org/
- **Wikipedia API**: https://www.mediawiki.org/wiki/API:Main_page
- **Tavily API**: https://tavily.com/
- **LangChain Retrievers**: https://python.langchain.com/docs/modules/data_connection/retrievers/

---

**Congratulations!** üéâ You've mastered external index retrievers in LangChain!

You can now build RAG systems that access:
- üìÑ Academic research (ArXiv)
- üìñ General knowledge (Wikipedia)
- üåê Real-time web data (Tavily)

Keep experimenting and building amazing AI applications! üöÄ