## ü§î What is Document Loading?

Before we can ask questions about our documents, we need to **load** them into memory.

```
PDF on Disk  ‚Üí  Document Loader  ‚Üí  Text in Python
```

**Why can't we just read the file?**
- PDFs are binary files, not plain text
- They contain formatting, images, fonts, etc.
- We need a special loader to extract the text

## Step 1: Import Required Libraries

In [None]:
# Import the PDF loader from LangChain
from langchain_community.document_loaders import PyPDFLoader
import os

print("‚úÖ Libraries imported successfully!")

### üí° Explanation

- **PyPDFLoader**: A LangChain class that reads PDF files and extracts text
- **os**: Standard Python library for working with file paths

---

## Step 2: Set Up File Paths

In [None]:
# Find the data folder (one level up from notebooks)
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_folder = os.path.join(project_root, 'data')

print(f"üìÅ Project root: {project_root}")
print(f"üìÅ Data folder: {data_folder}")

# List available PDFs
pdf_files = [f for f in os.listdir(data_folder) if f.endswith('.pdf')]
print(f"\nüìÑ Available PDFs:")
for pdf in pdf_files:
    print(f"   - {pdf}")

### üí° Explanation

- We use `os.path` to build file paths that work on any operating system
- The `data/` folder contains our PDF documents
- We list all `.pdf` files to see what we have

---

## Step 3: Load a Single PDF

In [None]:
# Load the first PDF
pdf_path = os.path.join(data_folder, pdf_files[0])
print(f"üìÑ Loading: {pdf_files[0]}")

# Create a loader for this PDF
loader = PyPDFLoader(pdf_path)

# Load the pages
pages = loader.load()

print(f"‚úÖ Loaded {len(pages)} pages!")

### üí° Explanation

**The loading process:**
```
1. PyPDFLoader(path)  ‚Üí  Create a loader for the PDF
2. loader.load()      ‚Üí  Read all pages and extract text
3. Returns: List of Document objects (one per page)
```

---

## Step 4: Understand the Document Object

In [None]:
# Look at the first page
first_page = pages[0]

print("üìÑ Document Object Structure:")
print("="*60)
print(f"Type: {type(first_page).__name__}")
print(f"\nüìù Content (first 500 characters):")
print("-"*60)
print(first_page.page_content[:500])
print("...")
print(f"\nüìã Metadata:")
print("-"*60)
print(first_page.metadata)

### üí° Explanation

Each **Document** object has two parts:

| Property | Description | Example |
|----------|-------------|--------|
| `page_content` | The actual text from the page | "Introduction to qualitative research..." |
| `metadata` | Information about the source | `{'source': 'file.pdf', 'page': 0}` |

**Why metadata matters:**
- Track which PDF a text came from
- Know which page contains the answer
- Cite sources in your RAG responses

---

## Step 5: Load Multiple PDFs

In [None]:
# Load ALL PDFs from the data folder
all_pages = []

print("üìö Loading all PDFs...\n")

for pdf_name in pdf_files:
    pdf_path = os.path.join(data_folder, pdf_name)
    print(f"   Loading: {pdf_name}")
    
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    
    print(f"   ‚úÖ Loaded {len(pages)} pages")
    all_pages.extend(pages)

print(f"\n{'='*60}")
print(f"üéâ Total pages loaded: {len(all_pages)}")

### üí° Explanation

We loop through all PDFs and combine their pages:

```
PDF 1 (10 pages)  ‚îÄ‚îê
                   ‚îú‚îÄ‚îÄ‚ñ∫  all_pages (combined list)
PDF 2 (15 pages)  ‚îÄ‚îò
```

The `extend()` method adds all pages from each PDF to our master list.

---

## üß™ Try It Yourself

Experiment with the loaded documents:

In [None]:
# Try these experiments!

# 1. How many characters in total?
total_chars = sum(len(page.page_content) for page in all_pages)
print(f"üìä Total characters: {total_chars:,}")

# 2. Average page length?
avg_length = total_chars // len(all_pages)
print(f"üìä Average page length: {avg_length:,} characters")

# 3. Which PDF has more content?
print(f"\nüìÑ Content by source:")
for pdf_name in pdf_files:
    pages_from_pdf = [p for p in all_pages if pdf_name in p.metadata.get('source', '')]
    chars = sum(len(p.page_content) for p in pages_from_pdf)
    print(f"   {pdf_name}: {len(pages_from_pdf)} pages, {chars:,} chars")

---

## ‚úÖ Summary

In this notebook, you learned:

1. **What document loading is** - Converting PDFs to text Python can use
2. **How to use PyPDFLoader** - LangChain's PDF reader
3. **The Document object** - Contains `page_content` and `metadata`
4. **Loading multiple files** - Combine all documents into one list

## ‚û°Ô∏è Next Step

In **Notebook 2: Text Chunking**, you'll learn why we need to split these pages into smaller pieces.

---

**Variables to use in next notebook:**
- `all_pages` - List of all Document objects from your PDFs