# Exploring Document Loaders in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [None]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0

In [None]:
# takes 2 - 5 mins to install on Colab
!pip install "unstructured[all-docs]==0.14.0"

After installing `unstructured`above remember to restart your session when it shows you the following popup, if it doesn't go to `Runtime`and `Restart Session`

![](https://i.imgur.com/UOBaotk.png)

In [None]:
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

In [None]:
!pip install jq==1.7.0
!pip install pypdf==4.2.0
!pip install pymupdf==1.24.4

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Text Loader

The simplest loader reads in a file as text and places it all into one document.



In [None]:
!curl -o README.md https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./README.md")
doc = loader.load()

In [None]:
len(doc)

In [None]:
type(doc[0])

In [None]:
print(doc[0].page_content[:10000])

### Markdown Loader

Markdown is a lightweight markup language for creating formatted text using a plain-text editor.

This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.

Load the whole document

In [None]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode='single')
docs = loader.load()

In [None]:
len(docs)

In [None]:
type(docs[0])

In [None]:
print(docs[0].page_content[:10000])

Load document and separate based on elements

In [None]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("./README.md", mode="elements")
docs = loader.load()

In [None]:
len(docs)

In [None]:
docs[:10]

In [None]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

Comparing Unstructured.io loaders vs LangChain wrapper API

In [None]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="./README.md")

In [None]:
len(docs)

In [None]:
docs[:10]

In [None]:
docs[0].to_dict()

In [None]:
docs[1].to_dict()

In [None]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

### CSV Loader

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

LangChain implements a CSV Loader that will load CSV files into a sequence of `Document` objects. Each row of the CSV file is converted to one document.

In [None]:
import pandas as pd

# Create a DataFrame with some dummy real estate data
data = {
    'Property_ID': [101, 102, 103, 104, 105],
    'Address': ['123 Elm St', '456 Oak St', '789 Pine St', '321 Maple St', '654 Cedar St'],
    'City': ['Springfield', 'Rivertown', 'Laketown', 'Hillside', 'Sunnyvale'],
    'State': ['CA', 'TX', 'FL', 'NY', 'CO'],
    'Zip_Code': [98765, 87654, 76543, 65432, 54321],
    'Bedrooms': [3, 2, 4, 3, 5],
    'Bathrooms': [2, 1, 3, 2, 4],
    'Listing_Price': [500000, 350000, 600000, 475000, 750000]
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="./data.csv")
docs = loader.load()

In [None]:
docs

In [None]:
docs[0]

In [None]:
print(docs[0].page_content)

`CSVLoader` will accept a `csv_args` kwarg that supports customization of arguments passed to Python's csv.`DictReader`. See the [`csv` module](https://docs.python.org/3/library/csv.html) documentation for more information of what `csv` args are supported.

In [None]:
loader = CSVLoader(file_path="./data.csv",
                   csv_args={
                      "delimiter": ",",
                      "quotechar": '"',
                      "fieldnames": ["Property ID", "Address", "City", "State",
                                     "Zip Code", "Bedrooms", "Bathrooms", "Price"],
                   },
                  )
docs = loader.load()

In [None]:
docs

Unstructured.io loads the entire CSV as a single table

In [None]:
from langchain_community.document_loaders import UnstructuredCSVLoader

loader = UnstructuredCSVLoader("./data.csv")
docs = loader.load()

In [None]:
len(docs)

In [None]:
docs[0]

### JSON Loader

[JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

[JSON Lines](https://jsonlines.org/) is a file format where each line is a valid JSON value.

LangChain implements a [JSONLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.json_loader.JSONLoader.html) to convert JSON and JSONL data into LangChain `Document` objects. It uses a specified [`jq` schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the `jq` python package. Check out [this manual](https://jqlang.github.io/jq/manual/) for a detailed documentation of the `jq` syntax.

In [None]:
import json

# Sample data dictionary similar to the one you provided but with modified contents
data = {
    'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_meeting.jpg'},
    'is_still_participant': True,
    'joinable_mode': {'link': '', 'mode': 1},
    'magic_words': [],
    'messages': [
        {'content': 'See you soon!',
         'sender_name': 'User B',
         'timestamp_ms': 1675597571851},
        {'content': 'Thanks for the update! See you then.',
         'sender_name': 'User A',
         'timestamp_ms': 1675597435669},
        {'content': 'Actually, the green one is sold out.',
         'sender_name': 'User B',
         'timestamp_ms': 1675596277579},
        {'content': 'I was hoping to purchase the green one!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595140251},
        {'content': 'I’m really interested in the green one, not the red!',
         'sender_name': 'User A',
         'timestamp_ms': 1675595109305},
        {'content': 'Here’s the $150 for it.',
         'sender_name': 'User B',
         'timestamp_ms': 1675595068468},
        {'photos': [{'creation_timestamp': 1675595059,
                     'uri': 'image_of_the_item.jpg'}],
         'sender_name': 'User B',
         'timestamp_ms': 1675595060730},
        {'content': 'It typically sells for at least $200 online',
         'sender_name': 'User B',
         'timestamp_ms': 1675595045152},
        {'content': 'How much are you asking?',
         'sender_name': 'User A',
         'timestamp_ms': 1675594799696},
        {'content': 'Good morning! $50 is far too low.',
         'sender_name': 'User B',
         'timestamp_ms': 1675577876645},
        {'content': 'Hello! I’m interested in the item you posted. I can offer $50. Let me know if that works for you. Thanks!',
         'sender_name': 'User A',
         'timestamp_ms': 1675549022673}
    ],
    'participants': [{'name': 'User A'}, {'name': 'User B'}],
    'thread_path': 'inbox/User A and User B chat',
    'title': 'User A and User B chat'
}

# Save the modified data to a JSON file
with open('chat_data.json', 'w') as file:
    json.dump(data, file, indent=4)


To load the full data as a single document

In [None]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.',
    text_content=False)

data = loader.load()

In [None]:
len(data)

In [None]:
data

Suppose we are interested in extracting the values under the `messages` key of the JSON data

In [None]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[]',
    text_content=False)

data = loader.load()
data

Suppose we are interested in extracting the values under the `content` field within the `messages` key of the JSON data

In [None]:
loader = JSONLoader(
    file_path='./chat_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
data

### PDF Loaders

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

LangChain integrates with a host of PDF parsers. Some are simple and relatively low-level; others will support OCR and image-processing, or perform advanced document layout analysis. The right choice will depend on your use-case and through experimentation.

Here we will see how to load PDF documents into the LangChain `Document` format

We download a research paper to experiment with

If the following command fails you can download the paper manually by going to http://arxiv.org/pdf/2103.15348.pdf, save it as `layoutparser_paper.pdf`and upload it on the left in Colab from the upload files option

In [None]:
!wget -O 'layoutparser_paper.pdf' 'http://arxiv.org/pdf/2103.15348.pdf'

#### PyPDFLoader

Here we load a PDF using `pypdf` into list of documents, where each document contains the page content and metadata with page number. Typically each PDF page becomes one document

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

In [None]:
len(pages)

In [None]:
pages[0]

In [None]:
print(pages[0].page_content)

In [None]:
print(pages[4].page_content)

#### PyMuPDFLoader

This is the fastest of the PDF parsing options, and contains detailed metadata about the PDF and its pages, as well as returns one document per page. It uses the `pymupdf` library internally.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./layoutparser_paper.pdf")
pages = loader.load()

In [None]:
len(pages)

In [None]:
pages[0]

In [None]:
pages[0].metadata

In [None]:
print(pages[0].page_content)

In [None]:
print(pages[4].page_content)

#### UnstructuredPDFLoader

[Unstructured.io](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [`UnstructuredPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.

Load PDF as a single document - no complex parsing

In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('./layoutparser_paper.pdf')
data = loader.load()

In [None]:
len(data)

In [None]:
print(data[0].page_content[:1000])

Load PDF with complex parsing, table detection and chunking by sections

In [None]:
# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('./layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000, # max size of chunks
                               new_after_n_chars=3800, # preferred size of chunks
                               combine_text_under_n_chars=2000, # smaller chunks < 2000 chars will be combined into a larger chunk
                               mode='elements')
data = loader.load()

In [None]:
len(data)

In [None]:
[doc.metadata['category'] for doc in data]

In [None]:
data[0]

In [None]:
print(data[0].page_content)

In [None]:
data[5]

In [None]:
data[5].page_content

In [None]:
from IPython.display import HTML

HTML(data[5].metadata['text_as_html'])

Load using raw unstructured.io APIs for PDFs

In [None]:
from unstructured.partition.pdf import partition_pdf

# Get elements - takes 3-4 mins
raw_pdf_elements = partition_pdf(
    filename="./layoutparser_paper.pdf",
    strategy='hi_res',
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="./",
)

In [None]:
len(raw_pdf_elements)

In [None]:
raw_pdf_elements

In [None]:
raw_pdf_elements[5].to_dict()

Convert into LangChain `document`format

In [None]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in raw_pdf_elements]
lc_docs[5]

### Microsoft Office Document Loaders

The Microsoft Office suite of productivity software includes Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Outlook, and Microsoft OneNote. It is available for Microsoft Windows and macOS operating systems. It is also available on Android and iOS.

[Unstructured.io](https://docs.unstructured.io/open-source/introduction/overview) provides a variety of document loaders to load MS Office documents. Check them out [here](https://docs.unstructured.io/open-source/core-functionality/partitioning).

Here we will leverage LangChain's [`UnstructuredWordDocumentLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader.html) to load data from a MS Word document.

In [None]:
!gdown 1DEz13a7k4yX9yFrWaz3QJqHdfecFYRV-

Load word doc as a single document

In [None]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader('./Intel Strategy.docx')
data = loader.load()

In [None]:
len(data)

In [None]:
data[0].page_content[:1000]

Load word doc with complex parsing and section based chunks

In [None]:
loader = UnstructuredWordDocumentLoader('./Intel Strategy.docx',
                                        strategy='fast',
                                        chunking_strategy="by_title",
                                        max_characters=3000, # max limit of a document chunk
                                        new_after_n_chars=2500, # preferred document chunk size
                                        mode='elements')
data = loader.load()

In [None]:
len(data)

In [None]:
data[0]

In [None]:
data[3]

In [None]:
data[4]

### Directory Loaders

LangChain's [`DirectoryLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html) implements functionality for reading files from disk into LangChain [`Document`](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects.

In [None]:
!wget -O 'Vision Transformers.pdf' 'https://arxiv.org/pdf/2010.11929.pdf'

We first define and assign specific loaders which can be used by LangChain when processing the files for a specific file type. We follow this format

```
loaders = {
  'file_format_extension' : (LoaderClass, LoaderKeywordArguments)
}
```

Where:

- `file_format_extension` can be anything like `.docx`, `.pdf`etc.
- `LoaderClass` is a specific data loader like `PyMuPDFLoader`
- `LoaderKeywordArguments` are any specific keyword arguments which needs to be passed into that loader at runtime

In [None]:
# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': (PyMuPDFLoader, {}),
    '.docx': (UnstructuredWordDocumentLoader, {'strategy': 'fast',
                                              'chunking_strategy' : 'by_title',
                                              'max_characters' : 3000, # max limit of a document chunk
                                              'new_after_n_chars' : 2500, # preferred document chunk size
                                              'mode' : 'elements'
                                              })
}

`DirectoryLoader` accepts a `loader_cls` argument, which defaults to `UnstructuredLoader` but we can pass our own loaders which we defined above in the `loader_cls`argument and any keyword args for the loader can be passed in the `loader_kwargs` argument.

We can also show a progress bar by setting `show_progress=True`

We can use the `glob` parameter to control which files to load based on file patterns

Here we create two separate loaders to load files which are word documents and PDFs

In [None]:
from langchain_community.document_loaders import DirectoryLoader

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type][0],
        loader_kwargs=loaders[file_type][1],
        show_progress=True
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', './')
docx_loader = create_directory_loader('.docx', './')

# Load the files
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()

In [None]:
len(pdf_documents)

In [None]:
pdf_documents[18]

In [None]:
len(docx_documents)