# Unstructured File

This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.

Please see [this guide](/docs/integrations/providers/unstructured/) for more instructions on setting up Unstructured locally, including setting up required system dependencies.

In [1]:
# # Install package
%pip install --upgrade --quiet "unstructured[all-docs]"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# # Install other dependencies
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # If parsing xml / html documents:
# !brew install libxml2
# !brew install libxslt

In [3]:
# import nltk
# nltk.download('punkt')

In [3]:
from langchain_community.document_loaders import UnstructuredFileLoader

In [5]:
loader = UnstructuredFileLoader("./example_data/whatsapp_chat.txt")

In [6]:
docs = loader.load()

In [7]:
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")

docs = loader.load()

docs[0].page_content[:400]

'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\n\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\n\n1/23/23, 2:59 AM - User 1: How much do you want?\n\n1/23/23, 3:00 AM - User 2: Online is at least $100\n\n1/23/23, 3:01 AM - User 2: Here is $129\n\n1/23/23, 3:01 AM - User 2: <Media omitted>\n\n1/23/23, 3:01 AM - User 1: Im not int'

### Load list of files

In [4]:
files = ["./example_data/whatsapp_chat.txt", "./example_data/layout-parser-paper.pdf"]

In [None]:
loader = UnstructuredFileLoader(files, mode="elements")

In [None]:
docs = loader.load()

In [None]:
print(docs[0].metadata.get("filename"), ": ", docs[0].page_content[:100])
print(docs[-1].metadata.get("filename"), ": ", docs[-1].page_content[:100])

whatsapp_chat.txt :  1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are in
layout-parser-paper.pdf :  layout analysis.


## Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.

In [5]:
loader = UnstructuredFileLoader("./example_data/whatsapp_chat.txt", mode="elements")

In [9]:
docs = loader.load()

In [12]:
loader = UnstructuredFileLoader(
    "./example_data/state_of_the_union.txt", mode="elements"
)

docs = loader.load()

docs[:5]

[Document(page_content='1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': './example_data/whatsapp_chat.txt', 'file_directory': './example_data', 'filename': 'whatsapp_chat.txt', 'last_modified': '2024-02-27T15:49:27', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText', 'element_id': '71e4a78b931ddca6fb1ee078314026ef'}),
 Document(page_content='1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.', metadata={'source': './example_data/whatsapp_chat.txt', 'file_directory': './example_data', 'filename': 'whatsapp_chat.txt', 'last_modified': '2024-02-27T15:49:27', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText', 'element_id': '30df3b2b0b05f31a3572c7191da50b1d'}),
 Document(page_content='1/23/23, 2:59 AM - User 1: How much do you want?', metadata={'source': './example_data/whatsapp_chat.txt', 'file_directory': './example_data', 'filename': 'whatsa

## Define a Partitioning Strategy

Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `"hi_res"` (the default) and `"fast"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below.

In [9]:
from langchain_community.document_loaders import UnstructuredFileLoader

In [2]:
loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf", strategy="hi_res", mode="elements"
)

In [3]:
docs = loader.load()

In [4]:
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf", strategy="fast", mode="elements"
)

docs = loader.load()

docs[5:10]

[Document(page_content='1 2 0 2 n u J 1 2 ] V C . s c [', metadata={'source': './example_data/layout-parser-paper.pdf', 'detection_class_prob': 0.5646073818206787, 'coordinates': {'points': ((45.388888888888886, 554.42724609375), (45.388888888888886, 1066.3888888888887), (100.94444444444446, 1066.3888888888887), (100.94444444444446, 554.42724609375)), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'last_modified': '2024-02-27T15:49:27', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'category': 'Header', 'element_id': '0c66db646ac6a0711c37657470c07f81'}),
 Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((45.388888888888886, 1094.1666666666665), (45.388888888888886, 1555.5555555555554), (100.94444444444446, 1555.5555555555554), (100.94444444444446, 1094.1666666666665

## PDF Example

Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are 
- `single` all the text from all elements are combined into one (default)
- `elements` maintain individual elements
- `paged` texts from each page are only combined

In [12]:
loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf", mode="elements"
)

docs = loader.load()

docs[5:10]

[Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}),
 Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'ele

If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example.

In [14]:
from langchain_community.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()

docs[5:10]

[Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}),
 Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'ele

## Unstructured API

If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/unstructured/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally.

In [1]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

In [2]:
loader = UnstructuredAPIFileLoader(
    file_path="example_data/fake.docx",
    api_key="FAKE_API_KEY",
    mode="elements",
)

In [4]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]

loader = UnstructuredAPIFileLoader(
    file_path=filenames[0],
    api_key="FAKE_API_KEY",
)

docs = loader.load()
docs[0]

INFO: Partitioning without split.
INFO: Successfully partitioned the document.


Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx', 'metadata': {'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'}, 'category': None, 'element_id': '56d531394823d81787d77a04462ed096'})

You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`.

In [5]:
loader = UnstructuredAPIFileLoader(
    file_path=["example_data/fake.docx", "example_data/fake-email.eml"],
    api_key="FAKE_API_KEY",
    mode="elements",
)

In [6]:
loader = UnstructuredAPIFileLoader(
    file_path=filenames,
    api_key="FAKE_API_KEY",
)

docs = loader.load()
print(docs[0].metadata["metadata"]["filename"], ": ", docs[0].page_content[:100])
print(docs[-1].metadata["metadata"]["filename"], ": ", docs[-1].page_content[:100])

INFO: Partitioning without split.
INFO: Successfully partitioned the document.
INFO: Partitioning without split.
INFO: Successfully partitioned the document.


fake.docx :  Lorem ipsum dolor sit amet.
fake-email.eml :  Violets are blue
