# 非结构化文件

本文档介绍如何使用 `Unstructured` 包加载多种类型的文件。目前 `Unstructured` 支持加载文本文件、幻灯片、HTML、PDF、图像等。

In [1]:
# # Install package
!pip install "unstructured[local-inference]"
!pip install layoutparser[layoutmodels,tesseract]

In [2]:
# # Install other dependencies
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # If parsing xml / html documents:
# !brew install libxml2
# !brew install libxslt

In [3]:
# import nltk
# nltk.download('punkt')

In [2]:
from langchain.document_loaders import UnstructuredFileLoader

In [5]:
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")

In [6]:
docs = loader.load()

In [7]:
docs[0].page_content[:400]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constit'

## 保留元素

`Unstructured` 在底层为不同的文本块创建了不同的“元素”。默认情况下我们将它们合并在一起，但您可以通过指定 `mode="elements"` 来轻松保持它们的分离。

In [8]:
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt", mode="elements")

In [9]:
docs = loader.load()

In [12]:
docs[:5]

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
 Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
 Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
 Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
 Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_

## 定义分区策略

`Unstructured` 文档加载器允许用户传递一个 `strategy` 参数，让 `unstructured` 知道如何对文档进行分区。目前支持的策略有 `"hi_res"`（默认值）和 `"fast"`。高分辨率分区策略更精确，但处理时间更长。快速策略更快地对文档进行分区，但牺牲了准确性。并不是所有的文档类型都有独立的高分辨率和快速分区策略。对于这些文档类型，`strategy` 关键字参数会被忽略。在某些情况下，如果存在依赖项缺失（例如文档分区的模型），高分辨率策略将退回到快速策略。下面可以看到如何将策略应用于 `UnstructuredFileLoader`。

In [1]:
from langchain.document_loaders import UnstructuredFileLoader

In [2]:
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")

In [3]:
docs = loader.load()

In [4]:
docs[:5]

[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
 Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category':

## PDF 示例

处理 PDF 文档的方式与处理其他类型的文件完全相同。Unstructured 检测文件类型并提取相同类型的“元素”。

In [1]:
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"

In [7]:
loader = UnstructuredFileLoader("./example_data/layout-parser-paper.pdf", mode="elements")

In [None]:
docs = loader.load()

In [1]:
docs[:5]

[Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Zejiang Shen 1 ( (ea)\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
 Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]

## Unstructured API

如果您想减少设置并快速入门，可以简单地运行 `pip install unstructured` 并使用 `UnstructuredAPIFileLoader` 或 `UnstructuredAPIFileIOLoader`。这将使用托管的 Unstructured API 处理您的文档。请注意，目前（截至2023年5月11日），Unstructured API 是开放的，但很快将需要 API 密钥。一旦可用，[Unstructured 文档](https://unstructured-io.github.io/)页将有关于如何生成 API 密钥的说明。如果您想自主托管 Unstructured API 或在本地运行它，请查看[这里](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image)的说明。

In [1]:
from langchain.document_loaders import UnstructuredAPIFileLoader

In [2]:
filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]

In [3]:
loader = UnstructuredAPIFileLoader(
    file_path=filenames[0],
    api_key="FAKE_API_KEY",
)

In [4]:
docs = loader.load()
docs[0]

Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})

您也可以通过使用 `UnstructuredAPIFileLoader` 在单个 API 中批处理多个文件。

In [5]:
loader = UnstructuredAPIFileLoader(
    file_path=filenames,
    api_key="FAKE_API_KEY",
)

In [6]:
docs = loader.load()
docs[0]

Document(page_content='Lorem ipsum dolor sit amet.\n\nThis is a test email to use for unit tests.\n\nImportant points:\n\nRoses are red\n\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})