# üìÅ Notebook 02: Document Loaders

**LangChain 1.0.5+ | Mixed Level Class**

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
1. Load documents from **PDF files** using PyPDFLoader
2. Load structured data from **CSV files**
3. Load JSON data from **API responses** or files
4. Scrape and load content from **web pages** (HTML)
5. Load **text files** and **markdown files**
6. **Batch process** multiple files using DirectoryLoader
7. Understand Document object structure

---

## üìñ Table of Contents

1. [Why Document Loaders?](#why-loaders)
2. [Document Object Structure](#document-structure)
3. [Loading PDF Files](#pdf-loading)
4. [Loading CSV Files](#csv-loading)
5. [Loading JSON Files](#json-loading)
6. [Loading Web Pages (HTML)](#html-loading)
7. [Loading Text and Markdown Files](#text-loading)
8. [Batch Loading with DirectoryLoader](#batch-loading)
9. [Comparison Table](#comparison)
10. [Best Practices](#best-practices)
11. [Summary & Exercises](#summary)

---

<a id="why-loaders"></a>
## 1. Why Document Loaders? ü§î

### üî∞ BEGINNER

**Document Loaders** are tools that help you convert files (PDFs, CSVs, web pages, etc.) into **Document objects** that LangChain can work with.

Think of them as **translators**:
- **Input**: Files in various formats (PDF, CSV, JSON, HTML)
- **Output**: Standardized Document objects with text content and metadata

### Why is this important?

Every RAG application needs to:
1. üì• **Load** data from various sources
2. üîÑ **Convert** it to a standard format
3. üìä **Extract** metadata (source, page number, etc.)
4. üéØ **Prepare** it for embedding and retrieval

Document Loaders handle all of this automatically!

### üéì INTERMEDIATE

All document loaders in LangChain implement the same interface:
- `.load()`: Load all documents at once (returns list[Document])
- `.lazy_load()`: Load documents one at a time (generator, memory efficient)

This consistency makes it easy to switch between different data sources.

In [1]:
# Setup: Import required libraries
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify setup
print("‚úÖ Environment loaded")
print(f"Current directory: {os.getcwd()}")
print(f"Sample data directory exists: {Path('sample_data').exists()}")

‚úÖ Environment loaded
Current directory: d:\AiCode\LcRAGBtCmp\simple-rag-langchain
Sample data directory exists: True


<a id="document-structure"></a>
## 2. Document Object Structure üìÑ

### üî∞ BEGINNER

Every Document has two main parts:
1. **page_content**: The actual text (string)
2. **metadata**: Information about the document (dictionary)

Think of it like a book:
- **page_content** = The story
- **metadata** = The cover information (title, author, page number, etc.)

In [2]:
from langchain_core.documents import Document

# Create a sample document
doc = Document(
    page_content="This is the actual content of the document. It contains the text we want to process.",
    metadata={
        "source": "example.pdf",
        "page": 1,
        "author": "John Doe",
        "date": "2025-01-15"
    }
)

# Inspect the document
print("üìÑ Document Structure:")
print(f"\nType: {type(doc)}")
print(f"\nContent (first 100 chars): {doc.page_content[:100]}...")
print(f"\nMetadata: {doc.metadata}")
print(f"\nSource: {doc.metadata['source']}")
print(f"Page Number: {doc.metadata['page']}")

üìÑ Document Structure:

Type: <class 'langchain_core.documents.base.Document'>

Content (first 100 chars): This is the actual content of the document. It contains the text we want to process....

Metadata: {'source': 'example.pdf', 'page': 1, 'author': 'John Doe', 'date': '2025-01-15'}

Source: example.pdf
Page Number: 1


<a id="pdf-loading"></a>
## 3. Loading PDF Files üìï

### üî∞ BEGINNER

**PyPDFLoader** is used to load PDF files. It:
- Extracts text from each page
- Creates one Document per page
- Automatically adds source and page number to metadata

### Example 1: Loading a Single PDF

In [6]:
from langchain_community.document_loaders import PyPDFLoader

# Load the "Attention is All You Need" paper (if it exists)
pdf_path = "./pdfs/attention.pdf"

if Path(pdf_path).exists():
    print(f"Loading PDF: {pdf_path}")
    print("‚è≥ This may take a moment...\n")
    
    # Create loader
    loader = PyPDFLoader(pdf_path)
    
    # Load all pages
    documents = loader.load()
    
    print(f"‚úÖ Loaded {len(documents)} pages\n")
    
    # Inspect first page
    print("üìÑ First Page:")
    print(f"   Content (first 200 chars): {documents[0].page_content[:200]}...")
    print(f"\n   Metadata: {documents[0].metadata}")
    
    # Inspect last page
    print(f"\nüìÑ Last Page (page {len(documents)}):")
    print(f"   Content (first 200 chars): {documents[-1].page_content[:200]}...")
    
else:
    print(f"‚ùå PDF not found: {pdf_path}")
    print("   Make sure the file exists in the project root")

Loading PDF: ./pdfs/attention.pdf
‚è≥ This may take a moment...

‚úÖ Loaded 15 pages

üìÑ First Page:
   Content (first 200 chars): Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
...

   Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './pdfs/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}

üìÑ Last Page (page 15):
   Content (first 200 chars): Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
per

### üéì INTERMEDIATE: Lazy Loading for Large PDFs

For very large PDFs, use `.lazy_load()` to process one page at a time:

In [7]:
# Lazy loading example
if Path(pdf_path).exists():
    loader = PyPDFLoader(pdf_path)
    
    print("üîÑ Lazy loading pages (memory efficient):")
    
    # Process first 3 pages only
    for i, page in enumerate(loader.lazy_load()):
        if i >= 5:  # Only process first 3 pages for demo
            break
        
        print(f"\nPage {i+1}:")
        print(f"  Length: {len(page.page_content)} characters")
        print(f"  Preview: {page.page_content[:100]}...")
    
    print("\nüí° Tip: Use lazy_load() for PDFs > 100 pages to save memory")

üîÑ Lazy loading pages (memory efficient):

Page 1:
  Length: 2859 characters
  Preview: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and...

Page 2:
  Length: 4257 characters
  Preview: 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...

Page 3:
  Length: 1826 characters
  Preview: Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture us...

Page 4:
  Length: 2505 characters
  Preview: Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (r...

Page 5:
  Length: 3188 characters
  Preview: output values. These are concatenated and once again projected, resulting in the final values, as
de...

üí° Tip: Use lazy_load() for PDFs > 100 pages to save memory


### Example 2: Loading Multiple PDFs from a Directory

In [9]:
# Load all PDFs from the pdfs/ directory
pdf_directory = "pdfs"

if Path(pdf_directory).exists():
    print(f"üìÇ Loading PDFs from: {pdf_directory}/\n")
    
    all_documents = []
    
    # Find all PDF files
    pdf_files = list(Path(pdf_directory).glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDF files:")
    
    for pdf_file in pdf_files:
        print(f"  - {pdf_file.name}")
        
        # Load each PDF
        loader = PyPDFLoader(str(pdf_file))
        docs = loader.load()
        all_documents.extend(docs)
        
        print(f"    ‚úÖ Loaded {len(docs)} pages")
    
    print(f"\nüìä Total: {len(all_documents)} pages from {len(pdf_files)} PDFs")
    
    # Show unique sources
    sources = set(doc.metadata['source'] for doc in all_documents)
    print(f"\nSources:")
    for source in sources:
        print(f"  - {Path(source).name}")
        
else:
    print(f"‚ùå Directory not found: {pdf_directory}")

üìÇ Loading PDFs from: pdfs/

Found 3 PDF files:
  - attention.pdf
    ‚úÖ Loaded 15 pages
  - rag.pdf
    ‚úÖ Loaded 19 pages
  - ragsurvey.pdf
    ‚úÖ Loaded 21 pages

üìä Total: 55 pages from 3 PDFs

Sources:
  - ragsurvey.pdf
  - attention.pdf
  - rag.pdf


<a id="csv-loading"></a>
## 4. Loading CSV Files üìä

### üî∞ BEGINNER

**CSVLoader** converts each row of a CSV file into a separate Document.

**Use cases:**
- Product catalogs
- FAQ databases
- Customer records
- Any tabular data

In [10]:
from langchain_community.document_loaders import CSVLoader

# Load the products CSV
csv_path = "sample_data/products.csv"

if Path(csv_path).exists():
    print(f"Loading CSV: {csv_path}\n")
    
    # Create loader
    loader = CSVLoader(
        file_path=csv_path,
        source_column="product_name"  # Which column to use as source in metadata
    )
    
    # Load all rows
    documents = loader.load()
    
    print(f"‚úÖ Loaded {len(documents)} products\n")
    
    # Inspect first 3 products
    for i, doc in enumerate(documents[:3], 1):
        print(f"{'='*70}")
        print(f"Product {i}:")
        print(f"{'='*70}")
        print(doc.page_content)
        print(f"\nSource: {doc.metadata['source']}")
        print(f"Row: {doc.metadata.get('row', 'N/A')}")
        print()
    
    print(f"... and {len(documents) - 3} more products")
    
else:
    print(f"‚ùå CSV not found: {csv_path}")

Loading CSV: sample_data/products.csv

‚úÖ Loaded 15 products

Product 1:
product_id: 1
product_name: Laptop Pro 15
category: Electronics
description: High-performance laptop with 15-inch display, Intel i7 processor, 16GB RAM, and 512GB SSD. Perfect for professional work and gaming.
price: 1299.99
stock: 45

Source: Laptop Pro 15
Row: 0

Product 2:
product_id: 2
product_name: Wireless Mouse
category: Accessories
description: Ergonomic wireless mouse with 6 programmable buttons, 2400 DPI optical sensor, and long battery life. Compatible with Windows and Mac.
price: 29.99
stock: 150

Source: Wireless Mouse
Row: 1

Product 3:
product_id: 3
product_name: USB-C Hub
category: Accessories
description: 7-in-1 USB-C hub with HDMI, USB 3.0 ports, SD card reader, and USB-C power delivery. Ideal for laptops and tablets.
price: 49.99
stock: 80

Source: USB-C Hub
Row: 2

... and 12 more products


### üéì INTERMEDIATE: Custom CSV Configuration

In [12]:
if Path(csv_path).exists():
    # Advanced CSV loading with custom configuration
    loader = CSVLoader(
        file_path=csv_path,
        csv_args={
            'delimiter': ',',
            'quotechar': '"',
            'fieldnames': None,  # Use first row as headers
        },
        source_column="product_id"  # Use product_id as source
    )
    
    docs = loader.load()
    
    # Show how metadata is different
    print("üìä CSV with custom configuration:\n")
    print(f"First document source: {docs[0].metadata['source']}")
    print(f"Content preview:\n{docs[0].page_content[:200]}...")

üìä CSV with custom configuration:

First document source: 1
Content preview:
product_id: 1
product_name: Laptop Pro 15
category: Electronics
description: High-performance laptop with 15-inch display, Intel i7 processor, 16GB RAM, and 512GB SSD. Perfect for professional work an...


<a id="json-loading"></a>
## 5. Loading JSON Files üîß

### üî∞ BEGINNER

**JSONLoader** extracts data from JSON files using **jq** syntax (a query language for JSON).

**Common use cases:**
- API responses
- Configuration files
- Structured data exports

In [1]:
%pip install jq

Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import JSONLoader
from pathlib import Path

# Load the API response JSON
json_path = "./sample_data/api_response.json"

if Path(json_path).exists():
    print(f"Loading JSON: {json_path}\n")
    
    # Create loader
    # jq_schema tells us where to find the content in the JSON
    # .articles[] means: get all items from the 'articles' array
    loader = JSONLoader(
        file_path=json_path,
        jq_schema=".articles[]",  # Extract each article
        text_content=False  # Return full JSON for each article
    )
    
    # Load articles
    documents = loader.load()
    
    print(f"‚úÖ Loaded {len(documents)} articles\n")
    
    # Inspect first article
    print("üì∞ First Article:")
    print(f"Content:\n{documents[0].page_content}\n")
    print(f"Metadata: {documents[0].metadata}")
    
else:
    print(f"‚ùå JSON not found: {json_path}")

Loading JSON: ./sample_data/api_response.json

‚úÖ Loaded 5 articles

üì∞ First Article:
Content:
{"id": "article_001", "title": "Introduction to Retrieval-Augmented Generation (RAG)", "author": "Dr. Sarah Chen", "published_date": "2025-01-10", "category": "Machine Learning", "tags": ["RAG", "LLM", "NLP", "AI"], "summary": "Retrieval-Augmented Generation (RAG) is a powerful technique that combines information retrieval with large language models to generate more accurate and contextual responses.", "content": "RAG systems work by first retrieving relevant documents from a knowledge base, then using those documents as context for a language model to generate responses. This approach significantly reduces hallucinations and provides more factual, grounded outputs. The architecture typically consists of three main components: a document store, an embedding model for semantic search, and a language model for generation.", "reading_time": "5 minutes", "views": 15420, "likes": 892}

Metadat

### üéì INTERMEDIATE: Extracting Specific Fields from JSON

In [5]:
if Path(json_path).exists():
    # Extract only the article content field
    loader = JSONLoader(
        file_path=json_path,
        jq_schema=".articles[].content",  # Get only 'content' field
        text_content=True  # Treat as plain text
    )
    
    docs = loader.load()
    
    print("üìù Extracted Article Contents Only:\n")
    for i, doc in enumerate(docs[:2], 1):
        print(f"{i}. {doc.page_content[:150]}...")
        print()

üìù Extracted Article Contents Only:

1. RAG systems work by first retrieving relevant documents from a knowledge base, then using those documents as context for a language model to generate ...

2. Vector databases like FAISS, Pinecone, and Chroma provide optimized storage and retrieval for embedding vectors. Unlike traditional databases that use...



### üî∞ BEGINNER TIP: Understanding jq Syntax

**jq** is like a GPS for JSON:

| jq Expression | Meaning |
|--------------|----------|
| `.` | Root of JSON |
| `.articles` | Get the 'articles' field |
| `.articles[]` | Get all items in 'articles' array |
| `.articles[0]` | Get first item in 'articles' array |
| `.articles[].title` | Get 'title' from each article |

**Example:**
```json
{
  "articles": [
    {"title": "Article 1", "content": "..."},
    {"title": "Article 2", "content": "..."}
  ]
}
```
- `.articles[]` ‚Üí Returns both articles
- `.articles[].title` ‚Üí Returns ["Article 1", "Article 2"]

<a id="html-loading"></a>
## 6. Loading Web Pages (HTML) üåê

### üî∞ BEGINNER

**WebBaseLoader** scrapes web pages and extracts text content.

**Important:** Only works with **static HTML**. For JavaScript-rendered sites, you'd need Playwright or Selenium.

### Example 1: Loading a Local HTML File

In [6]:
%pip install unstructured

Note: you may need to restart the kernel to use updated packages.


In [8]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

# Load our sample blog post
html_path = "sample_data/blog_post.html"

if Path(html_path).exists():
    print(f"Loading HTML: {html_path}\n")
    
    # For local files, we need to use file:// protocol
    file_url = f"file://{Path(html_path).absolute()}"
    
    # Create loader
    loader = UnstructuredHTMLLoader(html_path)
    
    # Load the page
    documents = loader.load()
    
    print(f"‚úÖ Loaded {len(documents)} document(s)\n")
    
    # Inspect content
    doc = documents[0]
    print(f"üìÑ Content length: {len(doc.page_content)} characters")
    print(f"\nüìù First 500 characters:\n{doc.page_content[:500]}...")
    print(f"\nüîç Metadata: {doc.metadata}")
    
else:
    print(f"‚ùå HTML not found: {html_path}")

Loading HTML: sample_data/blog_post.html

‚úÖ Loaded 1 document(s)

üìÑ Content length: 7197 characters

üìù First 500 characters:
Building Intelligent Applications with RAG

By Dr. Amanda Foster | January 15, 2025 | 12 min read

Introduction

In the rapidly evolving landscape of artificial intelligence, Retrieval-Augmented Generation (RAG) has emerged as a game-changing approach for building intelligent applications. Unlike traditional chatbots that rely solely on the knowledge embedded in their training data, RAG systems combine the power of information retrieval with language generation to produce more accurate, contextu...

üîç Metadata: {'source': 'sample_data/blog_post.html'}


### üéì INTERMEDIATE: Loading Multiple URLs

In [10]:
# Example: Load multiple web pages at once
# NOTE: This will actually make HTTP requests, so we're using examples

%pip install langchain_community

from langchain_community.document_loaders import WebBaseLoader

#Uncomment to try with real websites:
urls = [
    "https://python.langchain.com/docs/introduction/",
    "https://python.langchain.com/docs/expression_language/"
]

loader = WebBaseLoader(urls)
docs = loader.load()

print(f"Loaded {len(docs)} pages")
for doc in docs:
    print(f"  - {doc.metadata['source']}")

print("üí° WebBaseLoader Example:")
print("\nTo load web pages, use:")
# print("""loader = WebBaseLoader([
#     "https://example.com/page1",
#     "https://example.com/page2"
# ])""")
print("\n‚ö†Ô∏è Note: Only works with static HTML (no JavaScript rendering)")

Note: you may need to restart the kernel to use updated packages.


USER_AGENT environment variable not set, consider setting it to identify your requests.


Loaded 2 pages
  - https://python.langchain.com/docs/introduction/
  - https://python.langchain.com/docs/expression_language/
üí° WebBaseLoader Example:

To load web pages, use:

‚ö†Ô∏è Note: Only works with static HTML (no JavaScript rendering)


In [11]:
# Print content from both loaded pages
print("="*80)
print("üìÑ LOADED DOCUMENTS CONTENT")
print("="*80)

for i, doc in enumerate(docs, 1):
    print(f"\n{'='*80}")
    print(f"üìÑ PAGE {i}: {doc.metadata['source']}")
    print(f"{'='*80}")

    # Print first 1000 characters of content
    print(f"\nüìù Content Preview (first 1000 chars):")
    print(doc.page_content[:1000])
    print(f"\n... [Total length: {len(doc.page_content)} characters]")

    # Print metadata
    print(f"\nüîç Metadata:")
    for key, value in doc.metadata.items():
        print(f"   {key}: {value}")

    print("\n")

# Full content of a specific page
print("\n" + "="*80)
print("üìñ FULL CONTENT OF PAGE 1")
print("="*80)
print(docs[0].page_content)

#  Or for a simpler version to just see the content:

# Simple version - print both pages
# for i, doc in enumerate(docs, 1):
#     print(f"\n{'='*80}")
#     print(f"PAGE {i}: {doc.metadata['source']}")
#     print(f"{'='*80}\n")
#     print(doc.page_content)
#     print("\n")

üìÑ LOADED DOCUMENTS CONTENT

üìÑ PAGE 1: https://python.langchain.com/docs/introduction/

üìù Content Preview (first 1000 chars):
LangChain overview - Docs by LangChainSkip to main contentüöÄ Share how you're building agents for a chance to win LangChain swag!Docs by LangChain home pageLangChain + LangGraphSearch...‚åòKAsk AIGitHubTry LangSmithTry LangSmithSearch...NavigationLangChain overviewLangChainLangGraphDeep AgentsIntegrationsLearnReferenceContributePythonOverviewChangelogGet startedInstallQuickstartPhilosophyCore componentsAgentsModelsMessagesToolsShort-term memoryStreamingStructured outputMiddlewareOverviewBuilt-in middlewareCustom middlewareAdvanced usageGuardrailsRuntimeContext engineeringModel Context Protocol (MCP)Human-in-the-loopMulti-agentRetrievalLong-term memoryAgent developmentLangSmith StudioTestAgent Chat UIDeploy with LangSmithDeploymentObservabilityOn this page Install Create an agent Core benefitsLangChain overviewCopy pageCopy pageLangChain v1.x is now ava

<a id="text-loading"></a>
## 7. Loading Text and Markdown Files üìù

### üî∞ BEGINNER

For simple text files, use **TextLoader**.

In [12]:
from langchain_community.document_loaders import TextLoader

# Load the notes.txt file
txt_path = "./sample_data/notes.txt"

if Path(txt_path).exists():
    print(f"Loading text file: {txt_path}\n")
    
    # Create loader
    loader = TextLoader(txt_path, encoding="utf-8")
    
    # Load the file
    documents = loader.load()
    
    print(f"‚úÖ Loaded {len(documents)} document\n")
    
    doc = documents[0]
    print(f"üìÑ Content length: {len(doc.page_content)} characters")
    print(f"\nüìù First 300 characters:\n{doc.page_content[:300]}...")
    print(f"\nüîç Metadata: {doc.metadata}")
    
else:
    print(f"‚ùå Text file not found: {txt_path}")

Loading text file: ./sample_data/notes.txt

‚úÖ Loaded 1 document

üìÑ Content length: 8567 characters

üìù First 300 characters:
LANGCHAIN STUDY NOTES - RAG IMPLEMENTATION

Date: January 15, 2025
Topic: Retrieval-Augmented Generation with LangChain 1.0+


CORE CONCEPTS
-------------

1. Document Object Structure
   - page_content: The actual text content
   - metadata: Dictionary wit...

üîç Metadata: {'source': './sample_data/notes.txt'}


### Markdown Files

For Markdown files, use **UnstructuredMarkdownLoader** (preserves structure):

In [17]:
%pip install "unstructured[all]"

# UnstructuredMarkdownLoader and TextLoader are already imported and available

readme_path = "README.md"

# Check if README.md exists
if Path(readme_path).exists():
    try:
        loader = UnstructuredMarkdownLoader(readme_path)
        docs = loader.load()
        print(f"‚úÖ Loaded {len(docs)} document(s)")
        print(f"\nFirst 200 chars:\n{docs[0].page_content[:200]}...")
    except RuntimeError as e:
        print("‚ö†Ô∏è UnstructuredMarkdownLoader failed with RuntimeError.")
        print("   Falling back to TextLoader.")
        loader = TextLoader(readme_path)
        docs = loader.load()
        print(f"   ‚úÖ Loaded with TextLoader: {len(docs[0].page_content)} chars")
    except Exception as e:
        print(f"‚ö†Ô∏è Unexpected error: {e}")
        print("   Falling back to TextLoader.")
        loader = TextLoader(readme_path)
        docs = loader.load()
        print(f"   ‚úÖ Loaded with TextLoader: {len(docs[0].page_content)} chars")
else:
    print(f"‚ÑπÔ∏è No README.md found in current directory")

Note: you may need to restart the kernel to use updated packages.
‚ö†Ô∏è Unexpected error: No module named 'markdown'
   Falling back to TextLoader.




RuntimeError: Error loading README.md

<a id="batch-loading"></a>
## 8. Batch Loading with DirectoryLoader üìÇ

### üî∞ BEGINNER

**DirectoryLoader** loads all files from a directory automatically.

Perfect for:
- Loading entire document libraries
- Processing multiple files at once
- Building knowledge bases

In [18]:
from langchain_community.document_loaders import DirectoryLoader

# Load all files from sample_data directory
data_dir = "sample_data"

if Path(data_dir).exists():
    print(f"üìÇ Loading all text files from: {data_dir}/\n")
    
    # Create loader for .txt files only
    loader = DirectoryLoader(
        data_dir,
        glob="*.txt",  # Pattern to match files
        loader_cls=TextLoader,  # Use TextLoader for each file
        show_progress=True  # Show progress bar
    )
    
    # Load all matching files
    documents = loader.load()
    
    print(f"\n‚úÖ Loaded {len(documents)} text file(s)\n")
    
    # Show sources
    for doc in documents:
        print(f"  - {doc.metadata['source']} ({len(doc.page_content)} chars)")
        
else:
    print(f"‚ùå Directory not found: {data_dir}")

üìÇ Loading all text files from: sample_data/



100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 1482.61it/s]


‚úÖ Loaded 1 text file(s)

  - sample_data\notes.txt (8639 chars)





### üéì INTERMEDIATE: Loading Multiple File Types

In [20]:
from langchain_community.document_loaders import CSVLoader

# Advanced: Load all files from a directory (mixed types)
# This function handles different file types intelligently

def load_all_documents(directory: str) -> list:
    """
    Load documents from multiple file formats in a directory.
    
    Supports: PDF, TXT, CSV, JSON, HTML
    """
    all_docs = []
    directory_path = Path(directory)
    
    if not directory_path.exists():
        print(f"‚ùå Directory not found: {directory}")
        return []
    
    print(f"üìÇ Loading from: {directory}\n")
    
    # Load PDFs
    pdf_files = list(directory_path.glob("*.pdf"))
    for pdf in pdf_files:
        loader = PyPDFLoader(str(pdf))
        docs = loader.load()
        all_docs.extend(docs)
        print(f"  ‚úÖ PDF: {pdf.name} ({len(docs)} pages)")
    
    # Load TXT files
    txt_files = list(directory_path.glob("*.txt"))
    for txt in txt_files:
        loader = TextLoader(str(txt))
        docs = loader.load()
        all_docs.extend(docs)
        print(f"  ‚úÖ TXT: {txt.name}")
    
    # Load CSV files
    csv_files = list(directory_path.glob("*.csv"))
    for csv in csv_files:
        loader = CSVLoader(str(csv))
        docs = loader.load()
        all_docs.extend(docs)
        print(f"  ‚úÖ CSV: {csv.name} ({len(docs)} rows)")
    
    # Load JSON files
    json_files = list(directory_path.glob("*.json"))
    for json_file in json_files:
        try:
            loader = JSONLoader(
                str(json_file),
                jq_schema=".",
                text_content=False
            )
            docs = loader.load()
            all_docs.extend(docs)
            print(f"  ‚úÖ JSON: {json_file.name}")
        except Exception as e:
            print(f"  ‚ö†Ô∏è JSON: {json_file.name} (error: {str(e)[:50]}...)")
    
    print(f"\nüìä Total: {len(all_docs)} documents loaded")
    return all_docs

# Test the function
if Path("sample_data").exists():
    all_documents = load_all_documents(".\\sample_data")
    
    # Show summary
    print(f"\nüìà Summary:")
    sources = [doc.metadata['source'] for doc in all_documents]
    print(f"   Files loaded: {len(set(sources))}")
    print(f"   Total documents: {len(all_documents)}")

üìÇ Loading from: .\sample_data

  ‚úÖ TXT: notes.txt
  ‚úÖ CSV: products.csv (15 rows)
  ‚úÖ JSON: api_response.json

üìä Total: 17 documents loaded

üìà Summary:
   Files loaded: 3
   Total documents: 17


<a id="comparison"></a>
## 9. Loader Comparison Table üìä

### üî∞ BEGINNER REFERENCE

| Loader | File Type | Use Case | Documents Created |
|--------|-----------|----------|-------------------|
| **PyPDFLoader** | `.pdf` | Research papers, books, reports | 1 per page |
| **CSVLoader** | `.csv` | Product catalogs, data tables | 1 per row |
| **JSONLoader** | `.json` | API responses, config files | Depends on jq query |
| **WebBaseLoader** | Web URLs | Blog posts, documentation | 1 per URL |
| **TextLoader** | `.txt` | Plain text, logs | 1 per file |
| **UnstructuredMarkdownLoader** | `.md` | Documentation, notes | 1 per file |
| **DirectoryLoader** | Multiple | Batch processing | All files matching pattern |

### When to Use Which?

- üìï **Academic papers?** ‚Üí PyPDFLoader
- üìä **Structured data?** ‚Üí CSVLoader
- üîß **API data?** ‚Üí JSONLoader
- üåê **Web content?** ‚Üí WebBaseLoader
- üìù **Simple text?** ‚Üí TextLoader
- üìÇ **Entire folder?** ‚Üí DirectoryLoader

<a id="best-practices"></a>
## 10. Best Practices üåü

### üî∞ BEGINNER TIPS

#### 1. Always Check File Existence
```python
# ‚úÖ Good
if Path(file_path).exists():
    loader = PyPDFLoader(file_path)
    docs = loader.load()
else:
    print(f"File not found: {file_path}")

# ‚ùå Bad - Will crash if file doesn't exist
loader = PyPDFLoader(file_path)
docs = loader.load()
```

#### 2. Use Lazy Loading for Large Files
```python
# For PDFs > 100 pages or files > 10MB
for page in loader.lazy_load():
    process_page(page)
```

#### 3. Inspect Metadata
```python
# Always check what metadata is available
print(docs[0].metadata)
```

### üéì INTERMEDIATE TIPS

#### 1. Error Handling
```python
try:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
except FileNotFoundError:
    print(f"File not found: {pdf_path}")
except Exception as e:
    print(f"Error loading {pdf_path}: {e}")
```

#### 2. Add Custom Metadata
```python
# Add custom metadata after loading
for doc in documents:
    doc.metadata['loaded_at'] = datetime.now().isoformat()
    doc.metadata['category'] = 'research_paper'
```

#### 3. Filter Documents
```python
# Filter by metadata
research_docs = [
    doc for doc in all_documents 
    if 'research' in doc.metadata['source'].lower()
]
```

<a id="summary"></a>
## 11. Summary & Exercises üìù

### üéâ What You Learned

‚úÖ **Document Loaders** convert files into standardized Document objects

‚úÖ **PyPDFLoader** loads PDF files (1 document per page)

‚úÖ **CSVLoader** loads CSV data (1 document per row)

‚úÖ **JSONLoader** uses jq syntax to extract data from JSON

‚úÖ **WebBaseLoader** scrapes web pages (static HTML only)

‚úÖ **TextLoader** handles plain text files

‚úÖ **DirectoryLoader** batch processes multiple files

‚úÖ All loaders return **Document** objects with `page_content` and `metadata`

### üí° Practice Exercises

#### üî∞ Beginner Exercises

1. **Load a PDF and count pages**
   - Use PyPDFLoader to load `attention.pdf`
   - Print the number of pages
   - Print the first 100 characters of page 1

2. **Load CSV and find products by category**
   - Load `products.csv`
   - Filter documents to find only "Electronics"
   - Print product names

3. **Combine multiple files**
   - Load notes.txt, products.csv, and api_response.json
   - Count total documents
   - Print unique sources

#### üéì Intermediate Exercises

1. **Build a multi-format loader**
   - Create a function that accepts a directory path
   - Automatically detect file types (.pdf, .csv, .json, .txt)
   - Load all files and add custom metadata (file_type, loaded_date)

2. **Extract specific data from JSON**
   - Load `api_response.json`
   - Use jq to extract only article titles
   - Create a summary document with all titles

3. **Lazy load and process**
   - Use lazy_load() on a PDF
   - Process each page and extract pages containing specific keywords
   - Save filtered pages to a new list

### üìö Next Steps

In **Notebook 03: Text Splitting Strategies**, you'll learn how to:
- Split long documents into chunks
- Choose optimal chunk sizes
- Handle overlap for better context
- Use different splitters for different content types

---

**Congratulations! You now know how to load data from any source! üéâ**