<a href="https://colab.research.google.com/github/nsk-ai/RAG-Bootcamp-2025/blob/main/Document_Loaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://python.langchain.com/docs/integrations/document_loaders/

---
## LangChain Document Loaders

LangChain Document Loaders are components designed to ingest raw data from various sources and transform them into a **standardized format** called `Document` objects within the LangChain framework. These `Document` objects typically consist of **`page_content`** (the actual text data) and **`metadata`** (information related to the text, such as source, page number, or creation date).

The primary purpose of document loaders is to facilitate the use of diverse data in LangChain applications, particularly for tasks like **Retrieval-Augmented Generation (RAG)**. LangChain offers a wide range of document loaders to handle various data sources, including:

### File Loaders
For loading data from local files, such as:
* **Text files** (`TextLoader`)
* **PDF files** (`PyPDFLoader`)
* **CSV files** (`CSVLoader`)
* **JSON files** (`JSONLoader`)
* **Markdown files** (`UnstructuredMarkdownLoader`)
* **Code files** (e.g., Python, Java, C++) using specific loaders or generic `UnstructuredFileLoader`.

### Web Loaders
For loading data from remote sources, such as:
* **Web pages** (`WebBaseLoader`, `UnstructuredHTMLLoader`)
* **YouTube videos** (`YoutubeLoader`)
* **Notion databases** (`NotionDBLoader`)
* **Google Drive files** (`GoogleDriveLoader`)
* **Slack messages** (`SlackDataLoader`)

### Database Loaders
<small>For loading data from various databases.
---
All document loaders in LangChain implement the **`BaseLoader` interface** and typically provide methods like `.load()` to load data and `.lazy_load()` for more efficient loading of large datasets. This standardization allows for consistent data handling across different sources within LangChain applications.</small>
---

In [None]:
pip install langchain



In [None]:
pip install -qU langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/131.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.1/131.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import userdata
import os

# Access the secret using userdata.get()
my_variable = userdata.get('GROQ_API_KEY')

# You can also set it as an environment variable for use with os.getenv()
os.environ['GROQ_API_KEY'] = my_variable

In [None]:
%pip install -qU langchain_community pypdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.2/313.2 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./sample_data/langchain.pdf"
loader = PyPDFLoader(file_path)

In [None]:
docs = loader.load()
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-11-06T10:08:55+00:00', 'author': '', 'keywords': '', 'moddate': '2024-11-06T10:08:55+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'subject': '', 'title': '', 'trapped': '/False', 'source': './sample_data/langchain.pdf', 'total_pages': 14, 'page': 0, 'page_label': '1'}, page_content='LangChain\nVasilios Mavroudis\nAlan Turing Institute\nvmavroudis@turing.ac.uk\nAbstract. LangChain is a rapidly emerging framework that offers a ver-\nsatile and modular approach to developing applications powered by large\nlanguage models (LLMs). By leveraging LangChain, developers can sim-\nplify complex stages of the application lifecycle—such as development,\nproductionization, and deployment—making it easier to build scalable,\nstateful, and contextually aware applications. It provides tools for han-\ndling chat models, integrat

In [None]:
import pprint

pprint.pp(docs[0].metadata)

{'producer': 'pdfTeX-1.40.26',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-11-06T10:08:55+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2024-11-06T10:08:55+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live '
                    '2024) kpathsea version 6.4.0',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': './sample_data/langchain.pdf',
 'total_pages': 14,
 'page': 0,
 'page_label': '1'}


In [None]:
docs[2].page_content[:100]

'LangChain 3\nneeds, providing a flexible foundation for building scalable, secure, and multi-\nfunctio'

In [None]:
print(len(docs))

14


## Web Loaders

In [None]:
%pip install -qU langchain_community beautifulsoup4

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.nskai.org/")



To bypass SSL verification errors during fetching, you can set the "verify" option:



In [None]:
loader.requests_kwargs = {'verify':False}

In [None]:
docs = loader.load()
print(len(docs))

1


