<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/try_llamaparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Try LlamaParse on Multimodal PDF


* `llama_index >=0.10.4`

#### References
* [LlamaParse demo_advanced notebook][1]
* The PDF file was part of [this public report][2]


#### Lessons Learned
* Don't save PDF files in github repo, otherwise LlamaParse can't parse any text
* LlamaIndex's `SimpleDirectoryReader` can't load the saved PDF file here
* Parsing results on tables and images in this PDF example is almost none, therefore in later Q&A, can't get correct retrived info, can't get answer either.


[1]:https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
[2]:https://ir.pdf.com/static-files/094e56d1-e4b8-4b50-a8b5-3eb723434de6

In [None]:
!pip install llama-index
!pip install llama-index-core==0.10.6.post1
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse
!pip install unstructured[local-inference]
!pip install pymupdf

In [None]:
!wget "https://drive.google.com/file/d/1hhGvbqE4vce9fgQ8q_VdTFrDYJYCG7Bc/view?usp=sharing" -O public_proxy_statement.pdf

## Setup OpenAI and LlamaParse APIs

In [3]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()
import os


# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."  # FILL YOUR OWN LLAMA CLOUD API KEY
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-..."  # # FILL YOUR OWN OPENAI API KEY

In [4]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings


embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4-turbo")

Settings.llm = llm
Settings.embed_model = embed_model

## Using `LlamaParse` PDF reader for PDF Parsing


In [23]:
from llama_parse import LlamaParse


documents = LlamaParse(result_type="markdown").load_data("./public_proxy_statement.pdf")

Started parsing the file under job_id d8e7ddd2-ead3-46e5-b576-220cf888bd0a


In [7]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex


def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [8]:
page_nodes = get_page_nodes(documents)

print(len(page_nodes))
print()
print(page_nodes[0])
print()
print(page_nodes[-1])

25

Node ID: fc6d5020-5c3c-4ab1-8fbb-c38b19ef0e4c
Text: # Simplified Proxy Statement  This document serves as a
simplified proxy statement for the upcoming shareholder meeting.  #
Table of Contents  1. Introduction 2. Voting Information 3. Proposals
4. Financial Information 5. Contact Information  # 1. Introduction
Dear Shareholders,  We are pleased to invite you to our annual meeting
of shareholder...

Node ID: 6123ac71-08ce-428b-9fdb-49956a51b329
Text: m=v,wb" nonce="tgukbe5zRyW7gAbPJ9cgNg"></script><script
nonce="tgukbe5zRyW7gAbPJ9cgNg">window['_DRIVE_VIEWER_stiming'] =
[221,221,221,259,null,259,1731327812988];</script></body></html>


In [9]:
print(page_nodes[0].get_content())

# Simplified Proxy Statement

This document serves as a simplified proxy statement for the upcoming shareholder meeting.

# Table of Contents

1. Introduction
2. Voting Information
3. Proposals
4. Financial Information
5. Contact Information

# 1. Introduction

Dear Shareholders,

We are pleased to invite you to our annual meeting of shareholders. Your vote is important to us.

# 2. Voting Information

Shareholders are encouraged to vote by proxy. The deadline for submitting your proxy is April 15, 2023.

# 3. Proposals

|Proposal|Description|Vote Required|
|---|---|---|
|1|Election of Directors|Majority|
|2|Approval of Executive Compensation|Majority|
|3|Ratification of Auditors|Majority|

# 4. Financial Information

For detailed financial information, please refer to our annual report available on our website.

# 5. Contact Information

If you have any questions, please contact our investor relations department at investor@company.com.

# Footnotes

1. The proposals are subject to ch

In [10]:
print(page_nodes[-1].get_content())

m=v,wb" nonce="tgukbe5zRyW7gAbPJ9cgNg"></script><script
nonce="tgukbe5zRyW7gAbPJ9cgNg">window['_DRIVE_VIEWER_stiming'] =
[221,221,221,259,null,259,1731327812988];</script></body></html>


In [11]:
from llama_index.core.node_parser import MarkdownElementNodeParser


# Splits a markdown document into Text Nodes and Index Nodes corresponding to embedded objects (e.g. tables)
node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4-turbo"), num_workers=8
)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [13]:
print("********************** sample nodes **********************")
print(len(nodes))
print()
print(nodes[0])
print()
print(nodes[-1])
print()

print("********************** objects' content **********************")
print(len(objects))
print(objects[0].get_content())
print()
print(objects[-1].get_content())

********************** sample nodes **********************
58

Node ID: b040c23e-5dc0-4966-a439-b19bf61d3c36
Text: Simplified Proxy Statement  This document serves as a simplified
proxy statement for the upcoming shareholder meeting.   Table of
Contents  1. Introduction 2. Voting Information 3. Proposals 4.
Financial Information 5. Contact Information   1. Introduction  Dear
Shareholders,  We are pleased to invite you to our annual meeting of
shareholders. Y...

Node ID: 4dfea4e6-a2ae-46c1-9359-871cd7690804
Text: m=v,wb" nonce="tgukbe5zRyW7gAbPJ9cgNg"></script><script
nonce="tgukbe5zRyW7gAbPJ9cgNg">window['_DRIVE_VIEWER_stiming'] =
[221,221,221,259,null,259,1731327812988];</script></body></html>

********************** objects' content **********************
1
The table lists proposals for a corporate meeting, detailing the type of proposal, its description, and the voting requirement for each.,
with the following columns:
- Proposal: None
- Description: None
- Vote Required: None


Th

#### Vector Index Both Strategies

In [14]:
# dump both indexed tables and page text into the vector index
recursive_index = VectorStoreIndex(nodes=base_nodes + objects + page_nodes)

In [15]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## `LlamaParse` to Answer Questions Related to Parsed PDF

In [27]:
query1 = "What's the Amount and Nature of Beneficial Ownership for BlackRock?"

response1 = recursive_query_engine.query(query1)
print(response1)

[1;3;38;2;11;159;203mRetrieval entering 0e43f26c-4520-4fbc-879a-e1f9448939a5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What's the Amount and Nature of Beneficial Ownership for BlackRock?
[0m

pre tokenize:   0%|          | 0/1 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 36.19it/s]


The provided information does not include details about the amount and nature of beneficial ownership for BlackRock. For specific shareholder information, you may want to refer to the company's detailed financial disclosures or contact their investor relations department.


In [28]:
print(response1.source_nodes)

[NodeWithScore(node=TextNode(id_='7d317a04-36e2-48bf-99db-3a3ab660e439', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='2ba0f745-81cc-46dc-99fc-2bab9bfa76cc', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='ffded051f466d137f41ded44364cea47ba8eb635eee1e682affa21dc9782609e'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='8899ce29-b383-4ccc-85f2-147098250dce', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='a346339c6f533018ed73c79679e3c84069036fde4c3caca8edeafe1d9ce625a0')}, text='71699749,95271251,99437421,49833482,50498807,95254\n960,95104239,71530183,50273568,94597599,50439140,95087066,101507086,50529263,715\n45613,95234911,94724970,71897987,94413667,71185130,71387252,94434457,48966254,94\n397761,71478140,71659913,95317965,50297096,5737784,71722078,71560089,49623133,94\n904109,49375334,101543289,50266062,71574030,71849695,99311019,71924331,948

In [29]:
query2 = "What are the quity Compensation Plans Approved by Stockholders"

response2 = recursive_query_engine.query(query2)
print(response2)
print()
print(response2.source_nodes)

[1;3;38;2;11;159;203mRetrieval entering 0e43f26c-4520-4fbc-879a-e1f9448939a5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the quity Compensation Plans Approved by Stockholders
[0m

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 213.41it/s]


The document does not specifically mention "Equity Compensation Plans Approved by Stockholders." However, it includes a proposal for the "Approval of Executive Compensation," which typically encompasses various forms of compensation for executives, potentially including equity compensation plans. This proposal requires a majority vote to pass.

[NodeWithScore(node=TextNode(id_='6ad52a9a-c4c0-42e0-80ae-bbd4dc2a22d9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9d817132-21fb-4083-9ee1-51a1d71a8219', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='14868d464e32feacff773a4e07228257dbffd2934a3ce3f1b578751091aa95f8'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='b94ec672-0638-4ef0-abd7-55ab44ce6de7', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='0af965fa8208c4f5e8b1031bf740319475ac32625b7798700181d9f7e2053193')}, text='0,"docs-sptm":true,"docs-

In [30]:
query3 = "What's the relationship Between 'Compensation Actually Paid' and Performance Measures"

response3 = recursive_query_engine.query(query3)
print(response3)
print()
print(response3.source_nodes)

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 83.96it/s]


The relationship between "Compensation Actually Paid" and performance measures typically involves aligning the compensation of executives and other key employees with the performance of the company. This is done to incentivize the achievement of key performance indicators and company goals. Performance measures might include financial metrics such as revenue growth, profitability, return on investment, or other specific operational targets. Compensation structures often include bonuses, stock options, and other performance-based incentives to ensure that the interests of the executives align with the interests of the shareholders and the overall success of the company.

[NodeWithScore(node=TextNode(id_='8899ce29-b383-4ccc-85f2-147098250dce', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='2ba0f745-81cc-46dc-99fc-2bab9bfa76cc', node_type=<ObjectType.DOCUMENT: '4'>, metada

In [33]:
print(len(response3.source_nodes))

for n in response3.source_nodes:
  print(n.get_content())
  print()

5
"docs-mm":10,"docos-
dphl":10000,"docos-dpsl":9900,"docs-cpr":true,"docos-edutfr":false,"docs-
eveia":false,"uls":"","customer_type":"ND","docs-obsImUrl":"https://
ssl.gstatic.com/docs/common/netcheck.gif","docs-ecuach":false,"docs-
ecci":false,"docs-esi":false,"docs-cei":{"i":
[99437429,49833490,50498815,95254968,95104247,71530191,50273576,94597607,5043914
8,95087074,101507094,50529271,5705891,71545621,95234919,94724986,71897995,944136
75,5703839,71185138,71387260,94434465,5712647,48966262,94397769,71478148,7165992
1,101705085,50297104,5737800,71722094,71560097,49623141,94904117,49375342,101543
305,50266070,71574038,71849703,99311027,71924339,94813451,71079826,5704695,49823
132,94420765,50513182,71532737,49622751,101771798,49924734,101426124,71882134,10
1441512,71721015,95136081,5704745,94661850,49643963,94707472,101708551,71690008,
71679568,101544635,99247624,94526616,101809790,101440077,71502929,94573608,95314
640,101406802,71402445,94929278,95271053,49769385,49372463,101456380,95

In [34]:
query4 = "List 'Name and Address of Beneficial Owner'"

response4 = recursive_query_engine.query(query4)
print(response4)
print()
print(response4.source_nodes)

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 31.99it/s]


The document does not provide the names and addresses of beneficial owners.

[NodeWithScore(node=TextNode(id_='85042166-950b-4095-a7db-ab91f8402fdf', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a862bdf9-c188-42e1-8ccb-7c9e38b36808', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='dfd3595b82ed63da86c59cecda74230a481d797a31dda090fd2e525dac932cc3'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='4a84b703-1b8c-4d38-be6c-2b6719833c21', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='4efa7ae398d594cfa9cea0692398189d8786318707f14b871ceffc85a97b630f')}, text='95234919,95250121,95250137,95251142,95251150,9525496\n0,95254968,95266620,95266628,95271045,95271053,95271243,95271251,95288980,952889\n96,95314632,95314640,95317965,99237561,99237569,99247616,99247624,99311019,99311\n027,99327843,99327859,99338480,99338488,99368692,99368700,99375188,99375196,9