# Unstructured File

This notebook is a copy of the [Unstructured File notebook](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) that covers how to use `Unstructured` package to load files of many types.\
`Unstructured` currently supports loading of text files, *powerpoints*, *html*, *pdfs*, *images*, and more.

In [1]:
# import nltk
# nltk.download('punkt')

In [2]:
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")
docs = loader.load()
docs[0].page_content[:400]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constit'

## Load list of files

In [3]:
files = ["./example_data/whatsapp_chat.txt", "./example_data/layout-parser-paper.pdf"]
loader = UnstructuredFileLoader(files)
docs = loader.load()    
docs[0].page_content[:400]

'\n\n\n[05.05.23\n15:48:11] James: Hi here\n\n\n[11/8/21\n9:41:32 AM] User name: Message 123\n\n\n1/23/23\n3:19 AM - User 2: Bye!\n\n\n1/23/23\n3:22_AM - User 1: And let me know if anything changes\n\n\n[1/24/21\n12:41:03 PM] ~ User name 2: Of course!\n\n\n[2023/5/4\n16:13:23] ~ User 2: See you!\n\n\n7/19/22\n11:32\u202fPM - User 1: Hello\n\n\n7/20/22\n11:32\u202fam - User 2: Goodbye\n\n\n4/20/23\n9:42\u202fam - User 3: <Media omitted>\n\n\n6/29/23\n12'

## Retain Elements

Under the hood, Unstructured creates different “elements” for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying mode="elements".

In [4]:
loader = UnstructuredFileLoader(
    "./example_data/state_of_the_union.txt", mode="elements"
)

docs = loader.load()

docs[:5]

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-03-19T10:36:35', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),
 Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-03-19T10:36:35', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),
 Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': '.

## Define a Partitioning Strategy


Unstructured document loader allow users to pass in a strategy parameter that lets unstructured know how to partition the document. Currently supported strategies are "hi_res" (the default) and "fast". Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the strategy kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an UnstructuredFileLoader below.

In [5]:
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader(
    "example_data/layout-parser-paper-fast.pdf", strategy="fast", mode="elements"
)

docs = loader.load()
docs[:5]

[Document(page_content='1 2 0 2', metadata={'source': 'example_data/layout-parser-paper-fast.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': 'example_data', 'filename': 'layout-parser-paper-fast.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:46:07', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),
 Document(page_content='n u J', metadata={'source': 'example_data/layout-parser-paper-fast.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': 'example_data', 'filename': 'layout-parser-paper-fast.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:46:07', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),
 Document(page_content='1 2', metadata={

## PDF Example

Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are - single all the text from all elements are combined into one (default) - elements maintain individual elements - paged texts from each page are only combined

In [6]:
loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf", mode="elements"
)
docs = loader.load()
docs[:5]

[Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:47:31', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),
 Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:47:31', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),
 Document(page_content='1 2', metadata={'source': '.

If you need to post process the unstructured elements after extraction, you can pass in a list of str -> str functions to the post_processors kwarg when you instantiate the UnstructuredFileLoader. This applies to other Unstructured loaders as well. Below is an example.

In [7]:
from langchain_community.document_loaders import UnstructuredFileLoader
from unstructured.cleaners.core import clean_extra_whitespace

loader = UnstructuredFileLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="elements",
    post_processors=[clean_extra_whitespace],
)

docs = loader.load()
docs[:5]

[Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:47:31', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),
 Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 258.36), (16.34, 286.14), (36.34, 286.14), (36.34, 258.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-03-19T10:47:31', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),
 Document(page_content='1 2', metadata={'source': '.

## Unstructured API

If you want to get up and running with less set up, you can simply run pip install unstructured and use UnstructuredAPIFileLoader or UnstructuredAPIFileIOLoader. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key here. The Unstructured documentation page will have instructions on how to generate an API key once they’re available. Check out the instructions here if you’d like to self-host the Unstructured API or run it locally.

In [8]:
from langchain_community.document_loaders import UnstructuredAPIFileLoader

filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]

loader = UnstructuredAPIFileLoader(
    file_path=filenames[0],
    api_key="FAKE_API_KEY",
)

docs = loader.load()
docs[0]



SDKError: API error occurred: Status 401
{"detail":"API key is malformed, please type the API key correctly in the header."}