In [7]:
!apt-get install poppler-utils tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 0s (27.0 MB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 123635 files and directo

In [16]:
!pip install -qU \
    "unstructured[pdf]==0.15.13" \
    nltk==3.9.1

In [1]:
import nltk

nltk.__version__  # confirm you see 3.9.1, otherwise restart session

'3.9.1'

We need to download NLTK `punkt` tokenizer otherwise we will see:

```
Resource punkt not found. Please use the NLTK Downloader to obtain the resource
```

In [12]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [52]:
from unstructured.partition.auto import partition

article_url = "https://arxiv.org/pdf/1706.03762.pdf"
title = "Attention is All You Need"
elements = partition(
    url=article_url,
    strategy="fast",
    skip_infer_table_types=[],
)

This outputs everything into unstructured elements for us:

In [40]:
elements[:10]

[<unstructured.documents.elements.Text at 0x7a3648901720>,
 <unstructured.documents.elements.NarrativeText at 0x7a3648901570>,
 <unstructured.documents.elements.Title at 0x7a3648968850>,
 <unstructured.documents.elements.Text at 0x7a3648968a60>,
 <unstructured.documents.elements.NarrativeText at 0x7a3648968700>,
 <unstructured.documents.elements.Title at 0x7a3647f08d00>,
 <unstructured.documents.elements.Title at 0x7a3647f093c0>,
 <unstructured.documents.elements.Title at 0x7a3647f09fc0>,
 <unstructured.documents.elements.Title at 0x7a3647f0a410>,
 <unstructured.documents.elements.Title at 0x7a3647f0ae60>]

These are not perfect however, ideally you want to be post-processing these elements to merge and/or split.

In [46]:
for elem in elements[:6]:
    print(elem.text)

3 2 0 2
g u A 2
] L C . s c [
7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.
Attention Is All You Need


Using this we can begin to handle our data in a way that is context and structure aware. For example we may prefix headers to chunks. Let's try this section where we have `Title` elements followed by `NarrativeText` elements:

In [49]:
elements[30:45]

[<unstructured.documents.elements.Title at 0x7a3648986230>,
 <unstructured.documents.elements.NarrativeText at 0x7a3648985690>,
 <unstructured.documents.elements.Footer at 0x7a36489869b0>,
 <unstructured.documents.elements.Title at 0x7a3647cfeaa0>,
 <unstructured.documents.elements.NarrativeText at 0x7a3647cfea40>,
 <unstructured.documents.elements.Title at 0x7a3647cfe590>,
 <unstructured.documents.elements.NarrativeText at 0x7a364cf14400>,
 <unstructured.documents.elements.NarrativeText at 0x7a3647cffdc0>,
 <unstructured.documents.elements.Title at 0x7a3647cffd90>,
 <unstructured.documents.elements.NarrativeText at 0x7a3647cfc760>,
 <unstructured.documents.elements.Footer at 0x7a3647cfefe0>,
 <unstructured.documents.elements.Title at 0x7a36488846a0>,
 <unstructured.documents.elements.Title at 0x7a3648885600>,
 <unstructured.documents.elements.NarrativeText at 0x7a3648885900>,
 <unstructured.documents.elements.NarrativeText at 0x7a3648885d80>]

In [48]:
for elem in elements[30:45]:
    print(elem.text)

3 Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
2
Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise full

We will merge these blocks to create more context-aware chunks:

In [55]:
from unstructured.documents.elements import Title, NarrativeText

header = ""
chunks = []
for elem in elements[30:45]:
    if isinstance(elem, Title):
        header = elem.text
    elif isinstance(elem, NarrativeText):
        chunks.append(f"Document: {title}\nHeader: {header}\nContent: {elem.text}")

In [56]:
print(chunks[0])

Document: Attention is All You Need
Header: 3 Model Architecture
Content: Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.


Realistically we would want to be _chunking_ our content _before_ augmenting those chunks with additional information, but we can see from this how we can provide more context by simply considering the structure of data.