## Key Takeaways

### Most Important Loaders:
1. **PyPDFLoader** - PDFs (most common document format)
2. **WebBaseLoader** - Web scraping (any website)
3. **DirectoryLoader** - Batch loading multiple files
4. **TextLoader** - Simple text files
5. **CSVLoader** - Structured data

### For Social Media:
- **Twitter**: Requires API credentials (apply at developer.twitter.com)
- **LinkedIn**: No official loader, must use manual data export
- **Alternative**: Use web scraping tools (but respect ToS and robots.txt)

### Tips:
- Always check the metadata returned with documents
- Use text splitters after loading for better RAG performance
- Respect website rate limits and terms of service
- For APIs, always secure your credentials using environment variables

In [None]:
### Complete example: Load and process web content
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the webpage
loader = WebBaseLoader("https://abdullahakram.me/")
docs = loader.load()

# 2a. CONFIGURE the splitter (just setting up rules, NOT splitting yet!)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Maximum size of each chunk
    chunk_overlap=200     # Characters to overlap between chunks
)

# 2b. USE the splitter to actually split the documents (NOW we're splitting!)
chunks = text_splitter.split_documents(docs)

print(f"Original documents: {len(docs)}")
print(f"Split into chunks: {len(chunks)}")
print(f"\nFirst chunk preview:\n{chunks[0].page_content[:300]}...")
print(docs)

Original documents: 1
Split into chunks: 1

First chunk preview:
Abdullah Akram...
[Document(metadata={'source': 'https://abdullahakram.me/', 'title': 'Abdullah Akram', 'language': 'en'}, page_content='\n\n\n\n\n\nAbdullah Akram\n\n\n\n\n\n\n\n\n\n')]


## 11. Practical Example: Complete Web Scraping

Let's do a complete example of scraping a website and processing the data.

### Understanding Text Splitting (2 Steps)

**It's like using a knife to cut bread:**

**Step 1: Choose your knife** üî™
```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
```
- You're NOT cutting anything yet!
- You're just choosing WHICH knife to use and HOW it should cut
- `chunk_size=1000` = "Each piece should be 1000 characters"
- `chunk_overlap=200` = "Share 200 characters between pieces for context"

**Step 2: Actually cut the bread** üçû
```python
chunks = text_splitter.split_documents(docs)
```
- NOW you're actually cutting!
- You take your configured splitter and apply it to the documents
- `docs` = the whole document (uncut bread)
- `chunks` = the split pieces (sliced bread)

Think of it as: **Configuration** ‚Üí **Execution**

In [2]:
### JSONLoader
from langchain_community.document_loaders import JSONLoader

# Load JSON and extract specific fields
loader = JSONLoader(
    file_path="path/to/data.json",
    jq_schema=".messages[].content",  # Use jq syntax to extract specific fields
    text_content=False
)
docs = loader.load()

print(f"Loaded {len(docs)} items from JSON")

ImportError: jq package not found, please install it with `pip install jq`

## 10. JSON Loader

In [None]:
### YouTubeLoader - Get video transcripts
# Install: !pip install youtube-transcript-api pytube

# from langchain_community.document_loaders import YoutubeLoader
# 
# loader = YoutubeLoader.from_youtube_url(
#     "https://www.youtube.com/watch?v=VIDEO_ID",
#     add_video_info=True  # Include video metadata
# )
# docs = loader.load()
# 
# print(f"Video transcript: {docs[0].page_content[:300]}...")

print("YouTube loader requires: pip install youtube-transcript-api pytube")

## 9. YouTube Transcripts

Load transcripts from YouTube videos.

In [9]:
### UnstructuredWordDocumentLoader
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader("path/to/document.docx")
docs = loader.load()

print(f"Content preview: {docs[0].page_content[:200]}...")

ImportError: unstructured package not found, please install it with `pip install unstructured`

## 8. Word Documents

In [None]:
### CSVLoader
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("path/to/data.csv")
docs = loader.load()

# Each row becomes a document
print(f"Loaded {len(docs)} rows from CSV")
print(f"First row: {docs[0].page_content[:200]}...")

## 7. CSV Loader

Great for loading structured data.

In [None]:
### Load PDFs from a directory
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

loader = DirectoryLoader(
    "path/to/directory",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader  # Specify the loader for PDFs
)
docs = loader.load()

print(f"Loaded {len(docs)} PDF pages from directory")

In [None]:
### DirectoryLoader - Load all files from a directory
from langchain_community.document_loaders import DirectoryLoader

# Load all .txt files from a directory
loader = DirectoryLoader("path/to/directory", glob="**/*.txt")
docs = loader.load()

print(f"Loaded {len(docs)} documents from directory")

## 6. Directory Loaders

Load all files from a directory at once.

In [None]:
### TextLoader - Load .txt files
from langchain_community.document_loaders import TextLoader

loader = TextLoader("path/to/file.txt")
docs = loader.load()

print(f"Content: {docs[0].page_content[:200]}...")

## 5. Text File Loaders

Simple but essential for loading plain text files.

### LinkedIn
LinkedIn is more restrictive and doesn't have an official LangChain loader. Options:

1. **Manual Export**: Go to Settings > Data Privacy > Get a copy of your data
2. **Web Scraping** (against ToS): Not recommended as it violates LinkedIn's terms
3. **Third-party APIs**: Use services with LinkedIn API access (requires business approval)

In [None]:
### Twitter Loader (requires tweepy library and API keys)
# First install: !pip install tweepy

# from langchain_community.document_loaders import TwitterTweetLoader
# 
# # You need Twitter API credentials
# loader = TwitterTweetLoader.from_bearer_token(
#     oauth2_bearer_token="YOUR_BEARER_TOKEN",
#     twitter_users=["@username"],  # Username to scrape
#     number_tweets=50  # Number of tweets to fetch
# )
# docs = loader.load()

# Note: Twitter API requires approval and has rate limits
print("Twitter loader requires API credentials from developer.twitter.com")

## 4. Social Media Loaders

### Twitter/X
For Twitter, you need API credentials from Twitter Developer Portal.

In [None]:
### Custom web scraping with BeautifulSoup filters
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Load only specific parts of the webpage (e.g., only article content)
loader = WebBaseLoader(
    web_paths=["https://example.com/blog/article"],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("article-content", "post-title", "post-header")
        )
    }
)
docs = loader.load()
print("Loaded only specific sections of the page")

In [None]:
### Load multiple URLs at once
urls = [
    "https://python.langchain.com/docs/introduction/",
    "https://python.langchain.com/docs/get_started/installation/",
]

loader = WebBaseLoader(urls)
docs = loader.load()

print(f"Loaded {len(docs)} documents from {len(urls)} URLs")

In [None]:
### WebBaseLoader - Scrape any website
from langchain_community.document_loaders import WebBaseLoader

# Load a single webpage
loader = WebBaseLoader("https://python.langchain.com/docs/introduction/")
docs = loader.load()

print(f"Loaded {len(docs)} document(s)")
print(f"Content preview: {docs[0].page_content[:300]}...")
print(f"Source URL: {docs[0].metadata['source']}")

## 3. Web Loaders - Scraping Websites

WebBaseLoader is the most common way to scrape web content using BeautifulSoup.

## Quick Comparison Table

| Loader | Free? | JavaScript Support | Use Case | Complexity |
|--------|-------|-------------------|----------|------------|
| **WebBaseLoader** | ‚úÖ Yes | ‚ùå No | Single pages, blogs | Easy |
| **Unstructured** | ‚úÖ Yes | ‚ùå No | Complex HTML structure | Medium |
| **RecursiveURL** | ‚úÖ Yes | ‚ùå No | Entire websites | Medium |
| **Sitemap** | ‚úÖ Yes | ‚ùå No | Sites with sitemaps | Easy |
| **Spider** | üí∞ Paid | ‚úÖ Yes | Production apps | Easy |
| **FireCrawl** | üí∞ Paid | ‚úÖ Yes | Self-host option | Medium |
| **Docling** | ‚úÖ Yes | ‚ùå No | Technical documents | Hard |
| **Hyperbrowser** | üí∞ Paid | ‚úÖ Yes | SPAs, complex JS | Medium |
| **AgentQL** | üí∞ Paid | ‚úÖ Yes | AI data extraction | Easy |

## Decision Guide

**For Learning/Personal Projects:**
- Start with **WebBaseLoader** - simple and free

**For Production:**
- Simple sites ‚Üí **WebBaseLoader** or **Sitemap**
- JavaScript sites ‚Üí **Spider**, **FireCrawl**, or **Hyperbrowser**
- Entire websites ‚Üí **RecursiveURL** or **Sitemap**
- Specific data extraction ‚Üí **AgentQL**

**Key Differences:**
1. **Free vs Paid** - APIs cost money but handle complexity
2. **JavaScript** - Only paid APIs render JS properly
3. **Scope** - Single page vs entire website
4. **Data Quality** - APIs return cleaner, LLM-ready data

In [8]:
### Practical Example: Detecting JS Requirement

# Example 1: Static site - WebBaseLoader works fine
from langchain_community.document_loaders import WebBaseLoader

# Wikipedia is mostly static HTML
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Python_(programming_language)")
docs = loader.load()
print(f"‚úÖ Static site - Got {len(docs[0].page_content)} characters")

# Example 2: JS-heavy site - would need a browser
# Sites like these would return mostly empty with WebBaseLoader:
# - twitter.com (X) - React app
# - instagram.com - React app  
# - medium.com (some pages) - JavaScript loading
# 
# For these, you'd need:
# - Selenium (free but complex)
# - Playwright (free but complex)
# - Spider/Hyperbrowser APIs (paid but easy)
print(docs)

‚úÖ Static site - Got 89411 characters
[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'title': 'Python (programming language) - Wikipedia', 'language': 'en'}, page_content='\n\n\n\nPython (programming language) - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact us\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload fileSpecial pages\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDonate\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\nDonate Create account Log in\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\

## üîç Understanding JavaScript Support

### What is JavaScript Support?

**JavaScript support** means the scraper can **execute JavaScript code** that runs on a webpage, not just download the static HTML.

### The Problem:

Modern websites come in two types:

#### 1Ô∏è‚É£ **Static HTML** (Old-style websites)
```html
<!-- Content is already in the HTML when you download it -->
<html>
  <body>
    <h1>Welcome!</h1>
    <p>This content is visible immediately</p>
  </body>
</html>
```
‚úÖ Simple scrapers work fine (WebBaseLoader, BeautifulSoup)

#### 2Ô∏è‚É£ **JavaScript-Rendered** (Modern websites)
```html
<!-- Initial HTML is almost empty! -->
<html>
  <body>
    <div id="root"></div>
    <script src="app.js"></script>  <!-- Content loaded by JavaScript -->
  </body>
</html>
```
‚ùå Simple scrapers see an empty page!
‚úÖ Need JavaScript support (browsers, paid APIs)

### Real-World Example:

**Scenario:** Scraping a React-based e-commerce site

**With WebBaseLoader (No JS):**
```python
# You only get the skeleton HTML
<div id="root"></div>
# No products, no prices! üò¢
```

**With Hyperbrowser (JS Support):**
```python
# The browser runs JavaScript and you get the full content
<div id="root">
  <div class="product">iPhone 15 - $999</div>
  <div class="product">MacBook Pro - $2499</div>
  ...
</div>
# All products visible! üéâ
```

### How to Know if a Site Needs JavaScript?

**Test 1: View Page Source**
- Right-click webpage ‚Üí "View Page Source"
- If you see your content clearly in the HTML ‚Üí No JS needed ‚úÖ
- If you see mostly `<script>` tags and empty `<div>` ‚Üí Needs JS ‚ùå

**Test 2: Disable JavaScript**
- Open DevTools (F12) ‚Üí Settings ‚Üí Disable JavaScript
- Refresh page
- If content disappears ‚Üí Needs JS support

**Test 3: Common Signs**
- Single Page Applications (SPAs)
- Sites built with React, Vue, Angular
- Content that "loads" after page appears
- Infinite scrolling
- Dynamic filters/searches

### Solutions:

| Website Type | Use This |
|-------------|----------|
| Static HTML (blogs, docs) | WebBaseLoader ‚úÖ Free |
| Simple JS (basic interactions) | WebBaseLoader *might* work |
| Heavy JS (React/Vue apps) | Paid APIs (Spider, Hyperbrowser) üí∞ |
| Need browser actions (clicks, logins) | Hyperbrowser üí∞ |

In [None]:
### Example: AgentQL
# from langchain_community.document_loaders import AgentQLLoader
# 
# # Use natural language to extract data
# loader = AgentQLLoader(
#     api_key="YOUR_API_KEY",
#     url="https://shop.com",
#     query="Extract all product names and prices"  # Natural language!
# )
# docs = loader.load()

print("AgentQL - AI-powered extraction with natural language queries")

### 9. AgentQL (AI-Powered Extraction) ü§ñ

**What it does:** Uses AI to extract structured data using natural language queries

**Use Case:**
- When you need specific data fields
- Complex extraction tasks
- Natural language queries instead of CSS selectors

**Pros:**
- Natural language queries ("Get all product prices")
- AI understands page structure
- No need to write complex selectors
- Handles page changes better

**Cons:**
- Requires API key
- AI processing cost
- May be overkill for simple scraping

**When to use:** Complex data extraction, when page structure changes frequently, when you need specific fields extracted intelligently

In [None]:
### Example: Hyperbrowser
# Requires API key from hyperbrowser.ai

# from langchain_community.document_loaders import HyperbrowserLoader
# 
# loader = HyperbrowserLoader(
#     api_key="YOUR_API_KEY",
#     urls=["https://spa-app.com"]
# )
# docs = loader.load()

print("Hyperbrowser - Full browser rendering for JavaScript-heavy sites")

### 8. Hyperbrowser (Headless Browser Platform) üåê

**What it does:** Platform for running headless browsers at scale

**Use Case:**
- JavaScript-heavy single-page applications (SPAs)
- Sites that require browser rendering
- Complex user interactions needed

**Pros:**
- Full browser rendering
- Can handle any JavaScript
- Scalable cloud infrastructure
- Can interact with pages (click, scroll, etc.)

**Cons:**
- Paid service
- Slower than simple HTTP requests
- More expensive

**When to use:** React/Vue/Angular apps, sites requiring login, dynamic content that only loads with JavaScript

In [None]:
### Example: Docling
# Install: !pip install docling

# from langchain_community.document_loaders import DoclingLoader
# 
# loader = DoclingLoader("https://example.com/document.html")
# docs = loader.load()

print("Docling - Advanced document understanding from IBM")

### 7. Docling (Document Understanding)

**What it does:** Uses IBM's Docling library for advanced document understanding

**Use Case:**
- Complex documents with mixed content
- Scientific papers, reports
- When document structure matters

**Pros:**
- Advanced document understanding
- Good for PDFs converted to HTML
- Preserves semantic structure

**Cons:**
- Additional dependencies
- More complex setup

**When to use:** Technical documents, research papers, complex formatted content

In [None]:
### Example: FireCrawl
# from langchain_community.document_loaders import FireCrawlLoader
# 
# loader = FireCrawlLoader(
#     api_key="YOUR_API_KEY",
#     url="https://example.com",
#     mode="scrape"
# )
# docs = loader.load()

print("FireCrawl - API service, can be self-hosted for privacy")

### 6. FireCrawl (API - Can Deploy Locally) üöÄ

**What it does:** API service for web scraping that can also be self-hosted

**Use Case:**
- Production scraping
- JavaScript-heavy sites
- Can run on your own servers (privacy)

**Pros:**
- Can self-host (own infrastructure)
- Handles JavaScript rendering
- Good for complex modern websites
- Clean data output

**Cons:**
- Paid API (or self-hosting cost)
- Setup required for self-hosting

**When to use:** When you need control over infrastructure, privacy-sensitive data, modern JS sites

In [None]:
### Example: Spider API
# Requires API key from spider.cloud

# from langchain_community.document_loaders import SpiderLoader
# 
# loader = SpiderLoader(
#     api_key="YOUR_API_KEY",
#     url="https://example.com",
#     mode="scrape"  # or "crawl" for multiple pages
# )
# docs = loader.load()

print("Spider - Paid API service for production-grade scraping")

### 5. Spider (API - LLM-Ready Data) üî•

**What it does:** Cloud API service that crawls and returns cleaned, LLM-ready data

**Use Case:**
- Production applications
- When you need clean, formatted data
- Want to avoid rate limits and blocking

**Pros:**
- Returns pre-cleaned, LLM-ready data
- Handles JavaScript rendering
- No blocking issues
- Professional reliability

**Cons:**
- Requires API key (paid service)
- External dependency
- Cost per request

**When to use:** Production apps, complex sites with JavaScript, when reliability matters

In [None]:
### Example: SitemapLoader
from langchain_community.document_loaders.sitemap import SitemapLoader

# Load all pages from a sitemap
# loader = SitemapLoader(
#     web_path="https://example.com/sitemap.xml",
#     filter_urls=["https://example.com/blog/"]  # Optional: only load blog posts
# )
# docs = loader.load()
# print(f"Loaded {len(docs)} pages from sitemap")

print("SitemapLoader - Uses sitemap.xml for efficient scraping")

### 4. SitemapLoader (Use Site's Sitemap) üó∫Ô∏è

**What it does:** Reads a website's sitemap.xml file to scrape all listed pages

**Use Case:**
- When site has a sitemap (most professional sites do)
- More efficient than recursive crawling
- Respects site structure

**Pros:**
- Very efficient - site tells you all URLs
- Faster than recursive crawling
- Respects website's intended structure
- Can filter by URL patterns

**Cons:**
- Requires site to have a sitemap
- Only gets pages listed in sitemap

**When to use:** Professional websites, blogs, news sites (most have sitemaps at `/sitemap.xml`)

In [None]:
### Example: RecursiveURLLoader
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

# Scrape a site and all its child pages
# loader = RecursiveUrlLoader(
#     url="https://docs.example.com",
#     max_depth=2,  # How many levels deep to go
#     extractor=lambda x: x.text  # How to extract text
# )
# docs = loader.load()
# print(f"Scraped {len(docs)} pages recursively")

print("RecursiveURLLoader - Crawls entire website automatically")

### 3. RecursiveURLLoader (Crawl Entire Websites) üï∑Ô∏è

**What it does:** Starts from a root URL and automatically follows all child links

**Use Case:**
- Scraping entire websites or sections
- Documentation sites with multiple pages
- When you want to capture all linked pages automatically

**Pros:**
- Automatic link discovery
- Can set depth limits
- Can filter URLs with patterns

**Cons:**
- Can scrape too much if not configured properly
- Slower (processes multiple pages)
- Needs careful rate limiting

**When to use:** Complete documentation sites, blogs, knowledge bases

In [None]:
### Example: UnstructuredURLLoader
# Install: !pip install unstructured

# from langchain_community.document_loaders import UnstructuredURLLoader
# 
# loader = UnstructuredURLLoader(urls=["https://example.com"])
# docs = loader.load()
# # Better structure detection for complex pages

print("UnstructuredURLLoader - Better for complex HTML structure")

### 2. UnstructuredURLLoader (Advanced Parsing)

**What it does:** Uses the `Unstructured` library for advanced document parsing

**Use Case:**
- When you need better structure detection (headings, tables, lists)
- Complex HTML with multiple content types
- Better formatting preservation

**Pros:**
- Better at understanding document structure
- Handles tables, lists, and formatting better
- Can preserve document hierarchy

**Cons:**
- Requires additional installation (`pip install unstructured`)
- Slower than WebBaseLoader
- More resource-intensive

**When to use:** Documentation sites, articles with complex formatting, pages with tables

In [None]:
### Example: WebBaseLoader
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://example.com")
docs = loader.load()
print(f"Loaded: {docs[0].page_content[:100]}...")

### 1. WebBaseLoader (Basic Web Scraper) ‚≠ê Most Common

**What it does:** Uses `urllib` and `BeautifulSoup` to load and parse HTML

**Use Case:** 
- Simple web scraping of one or more URLs
- When you know the exact URLs you want to scrape
- Basic HTML parsing

**Pros:**
- Easy to use, no API key needed
- Free
- Can filter specific HTML elements with BeautifulSoup

**Cons:**
- Only loads the specific URLs you provide (doesn't follow links)
- No JavaScript rendering
- Can't handle complex modern websites with dynamic content

**When to use:** Single pages, blog posts, documentation pages

## 3.1. Complete Guide to All Web Scraping Loaders

LangChain offers many different web loaders for different use cases. Let's understand each one:

In [None]:
### PDFPlumberLoader - Better for complex PDFs with tables
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("path/to/your/document.pdf")
docs = loader.load()

print(f"Loaded {len(docs)} pages")
# PDFPlumber is better at extracting tables and complex layouts

In [None]:
### PyPDFLoader - Most common PDF loader
from langchain_community.document_loaders import PyPDFLoader

# Load a PDF file
loader = PyPDFLoader("path/to/your/document.pdf")
pages = loader.load()

# Each page is a separate document
print(f"Total pages loaded: {len(pages)}")
print(f"First page content (preview): {pages[0].page_content[:200]}...")
print(f"Metadata: {pages[0].metadata}")

## 2. PDF Loaders

PDFs are one of the most common document formats. LangChain offers multiple PDF loaders.

In [None]:
# Install required packages
# Run this cell first
!pip install langchain langchain-community langchain-text-splitters beautifulsoup4 pypdf pdfplumber python-docx

## 1. Installation

First, let's install the required packages for various loaders.

# LangChain Document Loaders Guide

This notebook covers the most important document loaders in LangChain for various data sources including PDFs, websites, social media, and more.