<a href="https://colab.research.google.com/github/kalai2315/RAG_System_Essentials/blob/main/LangchainDocument_Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Document Loaders in LangChain

In [None]:

%pip install -U --quiet langchain langchain-google-genai langchain-community

In [None]:
!pip install "unstructured[all-docs]==0.14.0"



In [None]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8 [186 kB]
Fetched 186 kB in 0s (423 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
de

In [None]:
!pip install jq==1.7.0
!pip install pypdf==4.2.0
!pip install pymupdf==1.24.4

Collecting jq==1.7.0
  Downloading jq-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Downloading jq-1.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (668 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/668.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m665.6/668.1 kB[0m [31m20.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m668.1/668.1 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jq
Successfully installed jq-1.7.0
Collecting pypdf==4.2.0
  Downloading pypdf-4.2.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
  Attempting uninstall: pypdf
    Found existing install

# **Document Loaders**

Document loaders are used to import data from various sources into LangChain as Document objects. A Document typically includes a piece of text along with its associated metadata.

# Examples of Document Loaders:

Text File Loader: Loads data from a simple .txt file.

Web Page Loader: Retrieves the text content from any web page.

YouTube Video Transcript Loader: Loads transcripts from YouTube videos.

Functionality:

Load Method: Each document loader has a load method that enables the loading of data as documents from a pre-configured source.

Lazy Load Option: Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit LangChain's document loader documentation.

In [None]:
!curl -o README.md https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5169  100  5169    0     0  23203      0 --:--:-- --:--:-- --:--:-- 23283


In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./README.md")
doc = loader.load()

In [None]:
len(doc)

1

In [None]:
type(doc[0])

In [None]:
print(doc[0].page_content[:100])

<picture>
  <source media="(prefers-color-scheme: light)" srcset="docs/static/img/logo-dark.svg">
  



# **Markdown Loader**

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode='single')
docs = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
len(docs)

1

In [None]:
type(docs[0])

In [None]:
print(docs[0].page_content[:100])

[!NOTE]
Looking for the JS/TS library? Check out LangChain.js.

LangChain is a framework for buildin


In [None]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode="elements")
docs = loader.load()

In [None]:
len(docs)

18

In [None]:
docs[:10]

[Document(metadata={'source': './README.md', 'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': 'ada4984f55e3bfe7057f8abd1b24a809'}, page_content='[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.'),
 Document(metadata={'source': './README.md', 'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md', 'category': 'NarrativeText', 'element_id': 'd9dd2676fd3ef14eba932c9da5c5636b'}, page_content='LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.'),
 Document(metadata={'source': './README.md', 'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'f

In [None]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

Counter({'NarrativeText': 7, 'Title': 4, 'ListItem': 7})

In [None]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="./README.md")

In [None]:
len(docs)

18

In [None]:
docs[:10]

[<unstructured.documents.elements.NarrativeText at 0x7a1c94e026d0>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c99fbe790>,
 <unstructured.documents.elements.Title at 0x7a1c99fbe3d0>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c99fbf690>,
 <unstructured.documents.elements.Title at 0x7a1c99fbce90>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c99fbf3d0>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c99fbca90>,
 <unstructured.documents.elements.Title at 0x7a1c9b426890>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c984ce5d0>,
 <unstructured.documents.elements.NarrativeText at 0x7a1c984cd8d0>]

In [None]:
docs[0].to_dict()

{'type': 'NarrativeText',
 'element_id': 'ada4984f55e3bfe7057f8abd1b24a809',
 'text': '[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.',
 'metadata': {'last_modified': '2025-06-28T15:57:20',
  'languages': ['eng'],
  'filetype': 'text/markdown',
  'file_directory': '.',
  'filename': 'README.md'}}

In [None]:
docs[1].to_dict()

{'type': 'NarrativeText',
 'element_id': 'd9dd2676fd3ef14eba932c9da5c5636b',
 'text': 'LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.',
 'metadata': {'last_modified': '2025-06-28T15:57:20',
  'languages': ['eng'],
  'filetype': 'text/markdown',
  'file_directory': '.',
  'filename': 'README.md'}}

In [None]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

[Document(metadata={'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='[!NOTE]\nLooking for the JS/TS library? Check out LangChain.js.'),
 Document(metadata={'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='LangChain is a framework for building LLM-powered applications. It helps you chain\ntogether interoperable components and third-party integrations to simplify AI\napplication development —  all while future-proofing decisions as the underlying\ntechnology evolves.'),
 Document(metadata={'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '.', 'filename': 'README.md'}, page_content='bash\npip install -U langchain'),
 Document(metadata={'last_modified': '2025-06-28T15:57:20', 'languages': ['eng'], 'parent_id': 'd096b8fd4a734

# **CSV Loader**

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of Document objects. Each row of the CSV file is converted to one document.

In [None]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="./data.csv")
docs = loader.load()

In [None]:
docs

[Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000'),
 Document(metadata={'source': './data.csv', 'row': 1}, page_content='Property_ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip_Code: 87654\nBedrooms: 2\nBathrooms: 1\nListing_Price: 350000'),
 Document(metadata={'source': './data.csv', 'row': 2}, page_content='Property_ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip_Code: 76543\nBedrooms: 4\nBathrooms: 3\nListing_Price: 600000'),
 Document(metadata={'source': './data.csv', 'row': 3}, page_content='Property_ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip_Code: 65432\nBedrooms: 3\nBathrooms: 2\nListing_Price: 475000'),
 Document(metadata={'source': './data.csv', 'row': 4}, page_content='Property_ID: 105\nAddress: 654 Cedar St\nCity: Sunnyvale\nState: CO\nZip_Code: 54321\nBedrooms: 5\nBathrooms

In [None]:
docs[0]

Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property_ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip_Code: 98765\nBedrooms: 3\nBathrooms: 2\nListing_Price: 500000')

In [None]:
print(docs[0].page_content)

Property_ID: 101
Address: 123 Elm St
City: Springfield
State: CA
Zip_Code: 98765
Bedrooms: 3
Bathrooms: 2
Listing_Price: 500000


CSVLoader will accept a csv_args kwarg that supports customization of arguments passed to Python's csv.DictReader. See the csv module documentation for more information of what csv args are supported.

In [None]:
loader = CSVLoader(file_path="./data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [None]:
docs

[Document(metadata={'source': './data.csv', 'row': 0}, page_content='Property ID: Property_ID\nAddress: Address\nCity: City\nState: State\nZip Code: Zip_Code\nBedrooms: Bedrooms\nBathrooms: Bathrooms\nPrice: Listing_Price'),
 Document(metadata={'source': './data.csv', 'row': 1}, page_content='Property ID: 101\nAddress: 123 Elm St\nCity: Springfield\nState: CA\nZip Code: 98765\nBedrooms: 3\nBathrooms: 2\nPrice: 500000'),
 Document(metadata={'source': './data.csv', 'row': 2}, page_content='Property ID: 102\nAddress: 456 Oak St\nCity: Rivertown\nState: TX\nZip Code: 87654\nBedrooms: 2\nBathrooms: 1\nPrice: 350000'),
 Document(metadata={'source': './data.csv', 'row': 3}, page_content='Property ID: 103\nAddress: 789 Pine St\nCity: Laketown\nState: FL\nZip Code: 76543\nBedrooms: 4\nBathrooms: 3\nPrice: 600000'),
 Document(metadata={'source': './data.csv', 'row': 4}, page_content='Property ID: 104\nAddress: 321 Maple St\nCity: Hillside\nState: NY\nZip Code: 65432\nBedrooms: 3\nBathrooms: 2\nP

Unstructured.io loads the entire CSV as a single table

In [None]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("./data.csv")
docs = loader.load()

In [None]:
len(docs)

1

In [None]:
docs[0]

Document(metadata={'source': './data.csv'}, page_content='\n\n\nProperty_ID\nAddress\nCity\nState\nZip_Code\nBedrooms\nBathrooms\nListing_Price\n\n\n101\n123 Elm St\nSpringfield\nCA\n98765\n3\n2\n500000\n\n\n102\n456 Oak St\nRivertown\nTX\n87654\n2\n1\n350000\n\n\n103\n789 Pine St\nLaketown\nFL\n76543\n4\n3\n600000\n\n\n104\n321 Maple St\nHillside\nNY\n65432\n3\n2\n475000\n\n\n105\n654 Cedar St\nSunnyvale\nCO\n54321\n5\n4\n750000\n\n\n')

# **JSON Loader**

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

JSON Lines is a file format where each line is a valid JSON value.

LangChain implements a JSONLoader to convert JSON and JSONL data into LangChain Document objects. It uses a specified jq schema to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the jq python package. Check out this manual for a detailed documentation of the jq syntax.

In [None]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)

To load the full data as a single document

In [None]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.',
    text_content=False)

data = loader.load()

In [None]:
len(data)

1

In [None]:
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='{"image": {"creation_timestamp": 1675549016, "uri": "image_of_the_meeting.jpg"}, "is_still_participant": true, "joinable_mode": {"link": "", "mode": 1}, "magic_words": [], "messages": [{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}, {"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}, {"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}, {"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}, {"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}, {"content": "Here\\u2019s the $150 for it.", "sender_name": "User B", "timestamp_ms": 1675595068468}, {"photos": [{"creation_timestamp": 1675595059, "uri": "image_of_the_item.jpg"}],

Suppose we are interested in extracting the values under the messages key of the JSON data

In [None]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='{"content": "See you soon!", "sender_name": "User B", "timestamp_ms": 1675597571851}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 2}, page_content='{"content": "Thanks for the update! See you then.", "sender_name": "User A", "timestamp_ms": 1675597435669}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 3}, page_content='{"content": "Actually, the green one is sold out.", "sender_name": "User B", "timestamp_ms": 1675596277579}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 4}, page_content='{"content": "I was hoping to purchase the green one!", "sender_name": "User A", "timestamp_ms": 1675595140251}'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 5}, page_content='{"content": "I\\u2019m really interested in the green one, not the red!", "sender_name": "User A", "timestamp_ms": 1675595109305}'),
 Document(met

Suppose we are interested in extracting the values under the content field within the messages key of the JSON data

In [None]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

[Document(metadata={'source': '/content/chat_data.json', 'seq_num': 1}, page_content='See you soon!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 2}, page_content='Thanks for the update! See you then.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 3}, page_content='Actually, the green one is sold out.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 4}, page_content='I was hoping to purchase the green one!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 5}, page_content='I’m really interested in the green one, not the red!'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 6}, page_content='Here’s the $150 for it.'),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 7}, page_content=''),
 Document(metadata={'source': '/content/chat_data.json', 'seq_num': 8}, page_content='It typically sells for at least $200 online'),
 Document(metadata={'source': '/conten

# **PDF Loaders**

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain Document format

In [None]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

--2025-06-28 16:16:36--  http://arxiv.org/pdf/2103.15348.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2103.15348 [following]
--2025-06-28 16:16:37--  http://arxiv.org/pdf/2103.15348
Reusing existing connection to arxiv.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 4686220 (4.5M) [application/pdf]
Saving to: ‘layoutparser_paper.pdf’


2025-06-28 16:16:37 (10.2 MB/s) - ‘layoutparser_paper.pdf’ saved [4686220/4686220]



# **PyPDFLoader**
Here we load a PDF using pypdf into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [None]:
len(pages)

16

In [None]:
pages[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'author': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': './layoutparser_paper.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven b

In [None]:
print(pages[0].page_content)

LayoutParser : A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1(  ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1Allen Institute for AI
shannons@allenai.org
2Brown University
ruochen zhang@brown.edu
3Harvard University
{melissadell,jacob carlson }@fas.harvard.edu
4University of Washington
bcgl@cs.washington.edu
5University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing an

In [None]:
print(pages[4].page_content)

LayoutParser : A Uniﬁed Toolkit for DL-Based DIA 5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset Base Model1Large Model Notes
PubLayNet [38] F / M M Layouts of modern scientiﬁc documents
PRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports
Newspaper [17] F - Layouts of scanned US newspapers from the 20th century
TableBank [18] F F Table region on modern scientiﬁc and business document
HJDataset [31] F / M - Layouts of history Japanese documents
1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [ 13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [ 28] (F) and Mask
R-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 backbone. The platform is maintained 

# **PyMuPDFLoader**
This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the pymupdf library internally.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

In [None]:
len(pages)

16

In [None]:
pages[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': './layoutparser_paper.pdf', 'file_path': './layoutparser_paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven 

In [None]:
pages[0].metadata

{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': './layoutparser_paper.pdf',
 'file_path': './layoutparser_paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z',
 'page': 0}

In [None]:
print(pages[0].page_content)

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural language processing

# **UnstructuredPDFLoader**
Unstructured.io supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF.

LangChain's UnstructuredPDFLoader integrates with Unstructured to parse PDF documents into LangChain Document objects.

Load PDF as a single document - no complex parsing

In [None]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

--2025-06-28 16:27:18--  http://arxiv.org/pdf/2103.15348.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2103.15348 [following]
--2025-06-28 16:27:19--  http://arxiv.org/pdf/2103.15348
Reusing existing connection to arxiv.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 4686220 (4.5M) [application/pdf]
Saving to: ‘layoutparser_paper.pdf’


2025-06-28 16:27:19 (68.8 MB/s) - ‘layoutparser_paper.pdf’ saved [4686220/4686220]



In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('./layoutparser_paper.pdf')
data = loader.load()

# **Microsoft Office Document Loaders**
The Microsoft Office suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.

Unstructured.io provides a variety of document loaders to load MS Office documents. Check them out here.

Here we will leverage LangChain's UnstructuredWordDocumentLoader to load data from a MS Word document.

In [None]:
!gdown 1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-

Downloading...
From: https://drive.google.com/uc?id=1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-
To: /content/Quantum Computing.docx
  0% 0.00/11.4k [00:00<?, ?B/s]100% 11.4k/11.4k [00:00<00:00, 28.4MB/s]


In [None]:
#Load word doc as a single document
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader('./Quantum Computing.docx')
data = loader.load()

In [None]:
len(data)

1

In [None]:
data[0].page_content[:1000]

'The Rise of Quantum Computing: A New Era of Innovation\n\nFor decades, classical computing has driven technological advancements, but the limitations of traditional binary processing are becoming evident as the world demands more computational power. Enter quantum computing—a revolutionary approach that leverages the principles of quantum mechanics to solve complex problems at unprecedented speeds.\n\nUnderstanding Quantum Computing\n\nUnlike classical computers that process information using bits (0s and 1s), quantum computers use qubits, which can exist in multiple states simultaneously due to superposition. This unique property allows quantum systems to process vast amounts of data in parallel, making them exponentially more powerful for specific tasks.\n\nAnother key principle, entanglement, enables qubits to be interconnected, meaning the state of one qubit is dependent on another, regardless of distance. This drastically enhances processing efficiency and speed, paving the way f

In [None]:
#Load word doc with complex parsing and section based chunks

loader = UnstructuredWordDocumentLoader('./Quantum Computing.docx',
                                        strategy='fast',
                                        chunking_strategy="by_title",
                                        max_characters=3000, # max limit of a document chunk
                                        new_after_n_chars=2500, # preferred document chunk size
                                        mode='elements')
data = loader.load()

In [None]:
len(data)

2

In [None]:
data[0]

Document(metadata={'source': './Quantum Computing.docx', 'emphasized_text_contents': ['Understanding Quantum Computing', 'qubits', 'superposition', 'entanglement', 'Applications Transforming Industries', 'Drug Discovery & Healthcare', 'Financial Modeling', 'Cybersecurity & Cryptography', 'post-quantum cryptography', 'Climate Modeling & Sustainability', 'AI & Machine Learning'], 'emphasized_text_tags': ['b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'], 'file_directory': '.', 'filename': 'Quantum Computing.docx', 'languages': ['eng'], 'last_modified': '2025-02-19T14:45:48', 'orig_elements': 'eJzVWE1v3EYS/SsNHRZZYChwOOR8+OY4ya6AOMDuak9xIBS7izMNkd10sznyONj/vq+aM9LYcmwF0CE6SdMf1V1Vr1695q+/X3DLHbt4Y83FK3VRa13mOqesIFpl5WKhs9qUdVZTs1nqqiiaankxUxcdRzIUCXt+v9AUeevD4cZwH3cYyrGisS3fGBtYR0yJ7cuL47CjjmXgXyO5OHbqje/6MVq3vTRef5BVLbntSFsesOzXC3bbi9/S6BBvOm9sYzndtsiLKsuLbL65npevyupVub74HxZG/hBl/nrH6t92YOUb9eiwV+q1+oXv1I+BZP7KOb+naL2TC8RDn654bWPLYvPzQK02tKnyjc640joraWmyNS8p07XJFxtd6Y2mlxOon3xQhjUZH

### Directory Loaders

LangChain's [`DirectoryLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects.

In [None]:
!wget -O 'Vision Transformers.pdf' 'https://arxiv.org/pdf/2010.11929.pdf'

--2025-06-28 16:29:43--  https://arxiv.org/pdf/2010.11929.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.195.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2010.11929 [following]
--2025-06-28 16:29:43--  http://arxiv.org/pdf/2010.11929
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3743814 (3.6M) [application/pdf]
Saving to: ‘Vision Transformers.pdf’


2025-06-28 16:29:43 (60.8 MB/s) - ‘Vision Transformers.pdf’ saved [3743814/3743814]



We first define and assign specific loaders which can be used by LangChain when processing the files for a specific file type. We follow this format

loaders = {
  'file_format_extension' : (LoaderClass, LoaderKeywordArguments)
}
Where:

file_format_extension can be anything like .docx, .pdfetc.
LoaderClass is a specific data loader like PyMuPDFLoader
LoaderKeywordArguments are any specific keyword arguments which needs to be passed into that loader at runtime

In [None]:
# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': (PyMuPDFLoader, {}),
    '.docx': (UnstructuredWordDocumentLoader, {'strategy': 'fast',
                                              'chunking_strategy' : 'by_title',
                                              'max_characters' : 3000, # max limit of a document chunk
                                              'new_after_n_chars' : 2500, # preferred document chunk size
                                              'mode' : 'elements'
                                              })
}

DirectoryLoader accepts a loader_cls argument, which defaults to UnstructuredLoader but we can pass our own loaders which we defined above in the loader_clsargument and any keyword args for the loader can be passed in the loader_kwargs argument.

We can also show a progress bar by setting show_progress=True

We can use the glob parameter to control which files to load based on file patterns

Here we create two separate loaders to load files which are word documents and PDFs

In [None]:
from langchain_community.document_loaders import DirectoryLoader

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type][0],
        loader_kwargs=loaders[file_type][1],
        show_progress=True
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', './')
docx_loader = create_directory_loader('.docx', './')

# Load the files
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()

100%|██████████| 2/2 [00:00<00:00,  8.15it/s]
100%|██████████| 1/1 [00:00<00:00, 11.12it/s]


In [None]:
len(pdf_documents)

38

In [None]:
pdf_documents[18]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-04T00:19:58+00:00', 'source': 'Vision Transformers.pdf', 'file_path': 'Vision Transformers.pdf', 'total_pages': 22, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-04T00:19:58+00:00', 'trapped': '', 'modDate': 'D:20210604001958Z', 'creationDate': 'D:20210604001958Z', 'page': 2}, page_content='Published as a conference paper at ICLR 2021\nTransformer Encoder\nMLP \nHead\nVision Transformer (ViT)\n*\nLinear Projection of Flattened Patches\n* Extra learnable\n     [ cl ass]  embedding\n1\n2\n3\n4\n5\n6\n7\n8\n9\n0\nPatch + Position \nEmbedding\nClass\nBird\nBall\nCar\n...\nEmbedded \nPatches\nMulti-Head \nAttention\nNorm\nMLP\nNorm\n+\nL x\n+\nTransformer Encoder\nFigure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,\nadd position embeddings, and feed the resulting sequence of vectors to a 

In [None]:
len(docx_documents)

2