https://llmsherpa.readthedocs.io/en/latest/llmsherpa.readers.html

In [81]:
from llmsherpa.readers import LayoutPDFReader
from llmsherpa.readers import LayoutReader
from llmsherpa.readers.layout_reader import ListItem, Paragraph, Table
from pprint import pprint
from IPython.display import display, HTML
from graphviz import Digraph

In [29]:
llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_file = "documents/word.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc_url = pdf_reader.read_pdf(pdf_url)
doc_file = pdf_reader.read_pdf(pdf_file)

In [4]:
print(doc_url.sections()[0].to_text())

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension


In [7]:
for i in range(5):
    print(doc_url.sections()[i].to_text())

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
{mikelewis,yinhanliu,naman}@fb.com
Abstract
1 Introduction
B D A B C D E


In [12]:
for i in range(5):
    print(doc_url.chunks()[i].to_text(), end="\n\n")

Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.
It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
We evaluate a number of noising approaches, ﬁnding the best performance by both randomly shufﬂing the order of the original sentences and using a novel in-ﬁlling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when ﬁne tuned for text generation but also works well for comprehension tasks.
It matches the performance of RoBERTa wi

In [22]:
doc_file.tables()[0].to_html()

'<table><th><td colSpan=1>Title of Each Class</td><td colSpan=1>Trading Symbol(s)</td><td colSpan=1>Name of Each Exchange on Which Registered</td></th><tr><td colSpan=1>Common Stock, par value $.01 per share</td><td colSpan=1>AMZN</td><td colSpan=1>Nasdaq Global Select Market</td></tr><tr><td>Securities registered pursuant to Section 12(g) of the Act: None</td></tr><tr><td colSpan=1>Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.</td><td colSpan=1>Yes ☒</td><td colSpan=1>No ☐</td></tr><tr><td colSpan=1>Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Exchange Act.</td><td colSpan=1>Yes ☐</td><td colSpan=1>No ☒</td></tr><tr><td>Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the

In [25]:
from IPython.display import display, HTML

html_content = doc_file.tables()[0].to_html()
display(HTML(html_content))

0,1,2
"Common Stock, par value $.01 per share",AMZN,Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act: None,,
"Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.",Yes ☒,No ☐
Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Exchange Act.,Yes ☐,No ☒
"Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes ☒ No ☐",,
Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes ☒ No ☐,,
"Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.",,
Large accelerated filer ☒,,Accelerated filer ☐
Non-accelerated filer ☐,,Smaller reporting company ☐


We could consider splitting text by pages. But continuous text data may be lost in that way.

In [31]:
print(doc_file.chunks()[58].to_text())

☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 or


In [33]:
for i in range(59, 65):
    print(doc_file.chunks()[i].to_text())

☐
For the transition period from to.
AMAZON.COM, INC.
Delaware
91-1646860 410 Terry Avenue North
Seattle, Washington 98109-5210 (206) 266-1000 Securities registered pursuant to Section 12(b) of the Act:


In [34]:
for i in range(59, 65):
    print(doc_file.sections()[i].to_text())

We Have a Rapidly Evolving Business Model and Our Stock Price Is Highly Volatile
Legal and Regulatory Risks
Government Regulation Is Evolving and Unfavorable Changes Could Harm Our Business
We Face Additional Tax Liabilities and Collection Obligations
We Are Subject to Risks Related to Government Contracts and Related Procurement Regulations
Item 1B. Unresolved Staff Comments


Using the RAW Json over the defualt PDF Reader for greater customisability

In [108]:
test = LayoutReader()
block = test.read(doc_file.json)

In [109]:
pprint(doc_file.json)

[{'bbox': [90.0, 72.24, 511.3, 146.39999999999998],
  'block_class': 'cls_0',
  'block_idx': 0,
  'level': 0,
  'page_idx': 0,
  'sentences': ['[Congressional Record Volume 170, Number 41 (Thursday, March '
                '7, 2024)] [Senate] [Pages S2272-S2277] From the Congressional '
                'Record Online through the Government Publishing Office '
                '[www.gpo.gov]'],
  'tag': 'para'},
 {'bbox': [108.0, 212.16, 492.0, 317.76],
  'block_class': 'cls_0',
  'block_idx': 1,
  'level': 0,
  'page_idx': 0,
  'sentences': ['PRESIDENTIAL MESSAGE REPORT ON THE STATE OF THE UNION '
                'DELIVERED TO A JOINT SESSION OF CONGRESS ON MARCH 7, 2024--PM '
                '41'],
  'tag': 'para'},
 {'bbox': [90.0, 336.72, 498.0, 395.52],
  'block_class': 'cls_0',
  'block_idx': 2,
  'level': 1,
  'page_idx': 0,
  'sentences': ['The PRESIDING OFFICER laid before the Senate the following '
                'message from the President of the United States which was '
   

In [110]:
listItem = None
count = 0
for item in doc_file.json:
    # print the first item with a tag of list_item
    if item["tag"] == "para":
        pprint(item)
        listItem = ListItem(item)
        if count > 10:
            break
        count+=1

{'bbox': [90.0, 72.24, 511.3, 146.39999999999998],
 'block_class': 'cls_0',
 'block_idx': 0,
 'level': 0,
 'page_idx': 0,
 'sentences': ['[Congressional Record Volume 170, Number 41 (Thursday, March '
               '7, 2024)] [Senate] [Pages S2272-S2277] From the Congressional '
               'Record Online through the Government Publishing Office '
               '[www.gpo.gov]'],
 'tag': 'para'}
{'bbox': [108.0, 212.16, 492.0, 317.76],
 'block_class': 'cls_0',
 'block_idx': 1,
 'level': 0,
 'page_idx': 0,
 'sentences': ['PRESIDENTIAL MESSAGE REPORT ON THE STATE OF THE UNION '
               'DELIVERED TO A JOINT SESSION OF CONGRESS ON MARCH 7, 2024--PM '
               '41'],
 'tag': 'para'}
{'bbox': [90.0, 336.72, 498.0, 395.52],
 'block_class': 'cls_0',
 'block_idx': 2,
 'level': 1,
 'page_idx': 0,
 'sentences': ['The PRESIDING OFFICER laid before the Senate the following '
               'message from the President of the United States which was '
               'ordered to lie 

In [111]:
# find all the unique tags in this document
unique_tags = set(item["tag"] for item in doc_file.json)
print(unique_tags)

{'para'}


In [112]:
HTML(listItem.to_html(True))

In [114]:
from pdf2image import convert_from_path
from PIL import ImageDraw

# Set the path to the PDF file
pdf_path = "documents/word.pdf"

# Set the page number and bounding box coordinates
page_number = 1  # Example: 1

# Convert the PDF page to an image
images = convert_from_path(pdf_path, first_page=page_number, last_page=page_number)

correction_multiple = 2.78
# bbox = (67.85*correction_multiple, 130.82*correction_multiple, 148.02*correction_multiple, 140.79999999999998*correction_multiple)
bbox = (108.0, 212.16, 492.0, 317.76)
bbox_corrected = tuple(coord * correction_multiple for coord in bbox)


# bbox_inches = (67.85, 130.82, 148.02, 140.79999999999998)
# bbox_points = tuple()
# Crop the image based on the bounding box coordinates
# cropped_image = images[0].crop(bbox)
page = images[0]

# Draw a red bounding box on the image
draw = ImageDraw.Draw(page)
draw.rectangle(bbox_corrected, outline="red")

# Save the cropped image with the bounding box
page.save("screenshot.png")


In [32]:
json_file = doc_file.json

In [33]:
pprint(json_file[:2])

[{'bbox': [90.0, 72.24, 511.3, 146.39999999999998],
  'block_class': 'cls_0',
  'block_idx': 0,
  'level': 0,
  'page_idx': 0,
  'sentences': ['[Congressional Record Volume 170, Number 41 (Thursday, March '
                '7, 2024)] [Senate] [Pages S2272-S2277] From the Congressional '
                'Record Online through the Government Publishing Office '
                '[www.gpo.gov]'],
  'tag': 'para'},
 {'bbox': [108.0, 212.16, 492.0, 317.76],
  'block_class': 'cls_0',
  'block_idx': 1,
  'level': 0,
  'page_idx': 0,
  'sentences': ['PRESIDENTIAL MESSAGE REPORT ON THE STATE OF THE UNION '
                'DELIVERED TO A JOINT SESSION OF CONGRESS ON MARCH 7, 2024--PM '
                '41'],
  'tag': 'para'}]


In [71]:
# find all the unique block_class
unique_block_classes = set(item["block_class"] for item in json_file)
print(unique_block_classes)

{'cls_0'}


In [77]:
# print all levels of the json file
unique_levels = set(item["level"] for item in json_file)
print(unique_levels)

{0, 1}


In [41]:
len(json_file)

78

In [53]:
temp = LayoutReader()
layout_tree = temp.read(doc_file.json)

In [75]:
print("Text:", layout_tree.children[2].to_text())
print()
print("Context to text:", layout_tree.children[2].to_context_text())

Text: The PRESIDING OFFICER laid before the Senate the following message from the President of the United States which was ordered to lie on the table:

Context to text: 
The PRESIDING OFFICER laid before the Senate the following message from the President of the United States which was ordered to lie on the table:


In [76]:
unique_tags = set(item["tag"] for item in json_file)
print(unique_tags)

{'para'}


In [79]:
for child in layout_tree.children:
    if child.level == 0:
        print (child.to_text())


[Congressional Record Volume 170, Number 41 (Thursday, March 7, 2024)] [Senate] [Pages S2272-S2277] From the Congressional Record Online through the Government Publishing Office [www.gpo.gov]
PRESIDENTIAL MESSAGE REPORT ON THE STATE OF THE UNION DELIVERED TO A JOINT SESSION OF CONGRESS ON MARCH 7, 2024--PM 41


In [80]:
def visualize_block_tree(root, graph=None):
    if graph is None:
        graph = Digraph()
    
    node_id = str(id(root))
    graph.node(node_id, f"{root.tag} (level {root.level})")

    for child in root.children:
        child_id = str(id(child))
        graph.node(child_id, f"{child.tag} (level {child.level})")
        graph.edge(node_id, child_id)
        # reset the root node and recurse
        visualize_block_tree(child, graph)
    
    return graph

In [83]:
layout_tree = temp.read(doc_url.json)

[Useful!] Helps us visualise the layout tree

In [84]:
graph = visualize_block_tree(layout_tree)
graph.render('block_tree', view=True)

'block_tree.pdf'

In [96]:
# identify the headers for this current level. There's many more headers.
headers = [block for block in layout_tree.children if block.tag == "header"]
print("Number of top level headers:")
print(len(headers))
print()
print("Top level headers right after root:")
for header in headers:
    print(header.to_text())

Number of top level headers:
3

Top level headers right after root:
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
8 Conclusions
References


In [98]:
print(headers[0].to_text(include_children=True))

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI
{mikelewis,yinhanliu,naman}@fb.com


In [103]:
print(headers[0].children)

[<llmsherpa.readers.layout_reader.Paragraph object at 0x114effdd0>, <llmsherpa.readers.layout_reader.Section object at 0x114efd550>]


In [105]:
for child in headers[0].children:
    print(child.tag)
    print(child.to_text())

para
Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI
header
{mikelewis,yinhanliu,naman}@fb.com


The reader identified "BART..." as a top-level header (level 0), that links to its two children "Mike Lewis..." as a paragraph (level 0), that links to "{mikelewis ...}..." as a second-level header (level 1)

![title](data/header-demo.png)

In [111]:
for i in range(6):
    print(headers[0].children[1].children[i].to_text())

Abstract
1 Introduction
2 Model
3 Fine-tuning BART
4 Comparing Pre-training Objectives
5 Large-scale Pre-training Experiments


As we can see from the above, the reader is able to indentify the respective headers of this pdf.

In [120]:
# find the deepest child of the first child of the first header
deepest_child = headers[0].children[0].children[5].children[0]
print("Deepest Child")
print(deepest_child.to_text())
print()
print("First Header")
print(headers[0].to_text())
# print the entire chain
print()
print("Entire chain")
print(headers[0].to_text(include_children=True, recurse=True))


Deepest Child
Recent work has shown that downstream performance can dramatically improve when pre-training is scaled to large batch sizes (Yang et al., 2019; Liu et al., 2019) and corpora.
To test how well BART performs in this regime, and to create a useful model for downstream tasks, we trained BART using the same scale as the RoBERTa model.

First Header
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Entire chain
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI
{mikelewis,yinhanliu,naman}@fb.com
Abstract
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original tex

In [130]:
print("Header 1")
print(headers[0].to_text())
print()
print("Header 1-1")
print("Type:", headers[0].children[1].tag)
print(headers[0].children[1].to_text())
print()
print("Header 1-1-1")
print("Type:", headers[0].children[1].children[0].tag)
print(headers[0].children[1].children[0].to_text())
print()
print("Header 1-1-1-1")
print("Type:", headers[0].children[1].children[0].children[0].tag)
print(headers[0].children[1].children[0].children[0].to_text())
# so headers[0].children[1].children[0].children[0] is our first deepest child

Header 1
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Header 1-1
Type: header
{mikelewis,yinhanliu,naman}@fb.com

Header 1-1-1
Type: header
Abstract

Header 1-1-1-1
Type: para
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.
It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
We evaluate a number of noising approaches, ﬁnding the best performance by both randomly shufﬂing the order of the original sentences and using a novel in-ﬁlling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when ﬁne tun

In [132]:
print(headers[0].children[1].children[0].children[0].parent_text())

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension > {mikelewis,yinhanliu,naman}@fb.com > Abstract


#### [Important!] Verdict: Instead of splitting from top-down, chunk the document using this tree structure, where we only add to the vector database the content of the leaf nodes + the parent_text as seen in the previous code block.

In [133]:
print(headers[0].children[1].children[0].children[0].to_context_text())

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension > {mikelewis,yinhanliu,naman}@fb.com > Abstract
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.
BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.
It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
We evaluate a number of noising approaches, ﬁnding the best performance by both randomly shufﬂing the order of the original sentences and using a novel in-ﬁlling scheme, where spans of text are replaced with a single mask token.
BART is particularly effective when ﬁne tuned for text generation but also works well for comprehension tasks.
It matches the p

#### [Most important!] Final note: Use the `to_context_text()` function of a node to output the content of a leaf node. Only take information from all leaf nodes.