## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

In [1]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["local-inference"]==0.5.2 layoutparser
# upgrade to the latest, though has not been tested
# %pip install -q --upgrade unstructured layoutparser
%pip install -q "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

Selecting previously unselected package poppler-utils.
(Reading database ... 122349 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.86.1-0ubuntu1.1_amd64.deb ...
Unpacking poppler-utils (0.86.1-0ubuntu1.1) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2build2_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2build2) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Setting up poppler-utils (0.86.1-0ubuntu1.1) ...
Setting up tesseract-ocr (4.1.1-2b

See our [example docs page](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) to find example docs used in this tutorial. You can also upload your own files by clicking on “Choose Files” on the left panel then select and upload the file to Colab.

In [2]:
!mkdir example-docs
# Install example-10k.html and layout-parser-paper.pdf
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html -P example-docs
!wget  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf -P example-docs

--2023-04-05 23:47:49--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘example-docs/example-10k.html’


2023-04-05 23:47:50 (36.0 MB/s) - ‘example-docs/example-10k.html’ saved [2456707/2456707]

--2023-04-05 23:47:50--  https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper-fast.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172270 (168K) [app

In [3]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### HTML Parsing

You can parse an HTML document using the following workflow:

In [5]:
from unstructured.documents.html import HTMLDocument

doc = HTMLDocument.from_file("example-docs/example-10k.html")

# This is how you would use a document from your google Drive
"""
from google.colab import drive
drive.mount('/content/drive/')
doc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")
"""

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive/\')\ndoc = HTMLDocument.from_file("drive/MyDrive/your-filename.html")\n'

The third page of output looks like the following:

In [6]:
print(doc.pages[2])

SPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTS

This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursue, will continue and other similar terms and phrases, as well as the use of the future tense.

Actual results could differ materially from those expressed or implied in our forward-looking statements. Our future financial condition and results of operations, as well as any forward-looking

In [7]:
doc.pages[2].elements

[<unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b9190>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b91f0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7f2b9d6b91c0>]

You can see that the parser successfully differentiated between titles and narrative text.

### PDF Parsing

You can use the following workflow to parse PDF documents.

In [None]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf")

The following are the counts for the types of elements present in the document:

In [None]:
from collections import Counter

display(Counter(type(element) for element in elements))

Let's display the type and text of the first 5 elements in the document:

In [None]:
display(*[(type(element), element.text) for element in elements[10:15]])

In [12]:
from unstructured.partition.pdf import partition_pdf

test_elements = partition_pdf("/content/test_pdf.pdf")

In [16]:
str(test_elements[0])

'Building applications with LLMs through composability'

In [17]:
len(test_elements)

24

In [None]:
[str(el) for el in test_elements]

In [None]:
!pip install langchain

In [24]:
from langchain.document_loaders import UnstructuredPDFLoader

test_lc = UnstructuredPDFLoader('/content/test_pdf.pdf',mode='elements')

In [27]:
len(test_lc.load())

24

In [29]:
lc_split = test_lc.load_and_split()

In [None]:
lc_split

In [None]:
!pip install pypdf

In [35]:
from langchain.document_loaders import PyPDFLoader

pypdf_test = PyPDFLoader('/content/test_pdf.pdf')

In [37]:
len(pypdf_test.load())

2

In [39]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

test_splitter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len)

In [38]:
len(pypdf_test.load_and_split())

2

In [40]:
len(pypdf_test.load_and_split(text_splitter=test_splitter))

38

In [41]:
pypdf_split = pypdf_test.load_and_split(text_splitter=test_splitter)

In [42]:
pypdf_split[0]

Document(page_content='L a n g C h a i n\nBuilding\napplications\nwith\nLLMs\nthrough\ncomposability\nProduction\nSupport:\nAs\nyou', metadata={'source': '/content/test_pdf.pdf', 'page': 0})