# SESSION 10 : Document Loaders in LangChain | Generative AI using LangChain | Video 10

https://youtu.be/bL92ALSZ2Cg?list=PLKnIA16_RmvaTbihpo4MtzVm4XOQa0ER0

![Screenshot%202025-08-24%20202002.png](attachment:Screenshot%202025-08-24%20202002.png)

### RAG 

It is a technique that combines information retrieval with language generation, where a model retrieves relevant documents from a knowledge base and then uses them as context to generate accurate and grounded responses.

#### Benefits of using RAG 

1. Use of up-to-date information 


2. Better privacy 


3. No limit of document size

![Screenshot%202025-08-24%20202046.png](attachment:Screenshot%202025-08-24%20202046.png)

## Document Loader

![Screenshot%202025-08-24%20202134.png](attachment:Screenshot%202025-08-24%20202134.png)


### Document Loaders :

* **Document Loaders** are components in LangChain used to __load raw data__ from various sources into a standardized format (usually as Document objects), which can then be used for chunking, embedding, retrieval, and generation. (PDFs, Word files, websites, databases, etc.) into LangChain.


* They return data in a standard format: a list of **`Document` objects** (`page_content` + `metadata`).


* Used as the **first step** in RAG pipelines (before splitting, embedding, and retrieval).


```python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example.pdf")
docs = loader.load()
print(docs[0].page_content, docs[0].metadata)
```

![image.png](attachment:image.png)

### 🔹 Most Widely Used Document Loaders

1. **File-based**

   * `PyPDFLoader` → PDFs
   * `UnstructuredFileLoader` → Word, PPT, HTML, TXT, etc.
   * `CSVLoader` → CSV files

2. **Web-based**

   * `WebBaseLoader` → Webpages
   * `SitemapLoader` → Websites via sitemap

3. **Cloud/DB-based**

   * `Docx2txtLoader` → Word docs
   * `AirbyteJSONLoader`, `DataFrameLoader` → DB / structured data
   * `GoogleDriveLoader`, `S3FileLoader` → Cloud storage

---

✅ **In short**: Document Loaders = **data ingestion tools** in LangChain.


They pull content from files, web, or databases → return standard `Document` objects → ready for text splitting and embedding.

### 1. Text Loader

__TextLoader__ is a simple and commonly used document loader in LangChain that reads __plain text (.txt) files__ and converts them into LangChain Document objects. 


#### Use Case 


- Ideal for loading chat logs, scraped text, transcripts, code snippets, or any plain text data into a LangChain pipeline. 

#### Limitation 

- Works only with .txt files

### 2. PyPDFLoader

PyPDFLoader is a document loader in LangChain used to load content from PDF files and convert each page into a Document object.

###### for each page we get seperate Document object

![image.png](attachment:image.png)

#### Limitations: 

- It uses the PyPDF library under the hood — not great with scanned PDFs or complex layouts.

### Other pdf loaders : https://python.langchain.com/docs/concepts/document_loaders/

![image.png](attachment:image.png)

### Directory Loader

DirectoryLoader is a document loader that lets you __load multiple documents from a directory (folder)__ of files.

![image.png](attachment:image.png)

### Load vs Lazy Load

![image.png](attachment:image.png)

### 3. Web Base Loader

WebBaseLoader is a document loader in LangChain used to load and extract text content from web pages (URLs).

It uses  BeautifulSoup under the hood to parse HTML and extract visible text.

#### When to Use:


- For blogs, news articles, or public websites where the content is primarily text-based and static

#### Limitations:


- Doesn’t handle JavaScript-heavy pages well (use SeleniumURLLoader for that). 



- Loads only static content (what's in the HTML, not what loads after the page renders

### 4. CSV Loader

CSVLoader is a document loader used to load CSV files into LangChain Document objects — one per row, by default.

### Other document loader in Langchain : 

### Custom Document Loader

https://python.langchain.com/docs/how_to/document_loader_custom/