# Unstructured File

This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.

In [None]:
# # Install package
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]

In [2]:
# # Install other dependencies
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # If parsing xml / html documents:
# !brew install libxml2
# !brew install libxslt

In [3]:
# import nltk
# nltk.download('punkt')

In [4]:
from langchain.document_loaders import UnstructuredFileLoader

In [5]:
loader = UnstructuredFileLoader("./example_data/whatsapp_chat.txt")

In [6]:
docs = loader.load()

In [7]:
docs[0].page_content[:400]

'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks! 1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low. 1/23/23, 2:59 AM - User 1: How much do you want? 1/23/23, 3:00 AM - User 2: Online is at least $100 1/23/23, 3:01 AM - User 2: Here is $129 1/23/23, 3:01 AM - User 2: <Media omitted> 1/23/23, 3:01 AM - User 1: Im not intereste'

## Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.

In [8]:
loader = UnstructuredFileLoader("./example_data/whatsapp_chat.txt", mode="elements")

In [9]:
docs = loader.load()

In [10]:
docs[:5]

[Document(page_content='1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks! 1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low. 1/23/23, 2:59 AM - User 1: How much do you want? 1/23/23, 3:00 AM - User 2: Online is at least $100 1/23/23, 3:01 AM - User 2: Here is $129 1/23/23, 3:01 AM - User 2: <Media omitted> 1/23/23, 3:01 AM - User 1: Im not interested in this bag. Im interested in the blue one! 1/23/23, 3:02 AM - User 1: I thought you were selling the blue one! 1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale 1/23/23, 3:19 AM - User 1: Oh no worries! Bye 1/23/23, 3:19 AM - User 2: Bye! 1/23/23, 3:22_AM - User 1: And let me know if anything changes', metadata={'source': './example_data/whatsapp_chat.txt', 'filetype': 'TXT', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/whatsapp_chat.txt', 'category': 'NarrativeText

## Define a Partitioning Strategy

Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `"hi_res"` (the default) and `"fast"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below.

In [1]:
from langchain.document_loaders import UnstructuredFileLoader

In [12]:
loader = UnstructuredFileLoader("./example_data/layout-parser-paper.pdf", strategy="fast", mode="elements")

In [13]:
docs = loader.load()

In [14]:
docs[:5]

[Document(page_content='1 2 0 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
 Document(page_content='n u J', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}),
 Document(page_content='1 2', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
 Document(page_content=']', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': '

## PDF Example

Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of `elements`. 

In [None]:
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"

In [3]:
loader = UnstructuredFileLoader("./example_data/layout-parser-paper.pdf", mode="elements")

In [None]:
docs = loader.load()

In [5]:
docs[:5]

[Document(page_content='2103.15348v2 [cs.CV] 21 Jun 2021', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'UncategorizedText'}),
 Document(page_content='arXiv', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}),
 Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'filetype': 'PDF', 'file_mod_date': '2023-05-10 02:39:59', 'file_create_date': '2023-05-10 02:39:59', 'filename': './example_data/layout-parser-paper.pdf', 'page_number': 1, 'category': 'Title'}),
 Document(page