<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/try_llamaparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Try LlamaParse on Multimodal PDF


* `llama_index >=0.10.4`

#### References
* [LlamaParse demo_advanced notebook][1]
* [The PDF file][2]



[1]:https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
[2]:https://github.com/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/dataset/page78.pdf

In [None]:
!pip install llama-index
!pip install llama-index-core==0.10.6.post1
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse
!pip install unstructured[local-inference]
!pip install httpx==0.27.2

## Setup OpenAI and LlamaParse APIs

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()
import os


os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."  # FILL YOUR OWN LLAMA CLOUD API KEY
os.environ["OPENAI_API_KEY"] = "sk-..."  # FILL YOUR OWN OPENAI API KEY

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

# use OpenAI Embedding model and LLM model
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4-turbo")

Settings.llm = llm
Settings.embed_model = embed_model

## Using `LlamaParse` PDF reader for PDF Parsing


In [None]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./page78.pdf")
documents

Started parsing the file under job_id e84568d6-0d63-43ff-86c8-8f971bdecbcd


[Document(id_='8c44b3bd-7e22-4b40-8c01-8fba91dc7b0f', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='# Document 1\n\nhis works are considered classics of American literature. His wartime experience formed the basis for his novel "A Farewell to Arms" (1929).\n\ncommunity His debut novel, "The Sun Also Rises" artists of the 1920s "Lost Generation" expatriate, was published in 1926.\n\n# Figure 2\n\nRAG-Token document posterior p(za/z, yi, y-i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises."\n\n# Table 3: Examples from generation tasks.\n\n|Task|Input|Model|Generation|\n|---|---|---|---|\n|define middle ear| |BART|\'The middle ear is the part of the ear between the middle ear and the

In [None]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex

def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [None]:
page_nodes = get_page_nodes(documents)

print(len(page_nodes))
print()
print(page_nodes[0])
print()
print(page_nodes[-1])

2

Node ID: f279fec7-087c-4427-a6ce-e5adbe09371f
Text: # Document 1  his works are considered classics of American
literature. His wartime experience formed the basis for his novel "A
Farewell to Arms" (1929).  community His debut novel, "The Sun Also
Rises" artists of the 1920s "Lost Generation" expatriate, was
published in 1926.  # Figure 2  RAG-Token document posterior p(za/z,
yi, y-i) for each g...

Node ID: 88fb3432-51ca-4fc5-875f-dbe196d8fc62
Text: # Table 4: Human assessments for the Jeopardy generation tasks
|Question Generation Task|Factuality|Specificity| |---|---|---| |BART
better|7.1%|16.89| |RAG better|42.7%|37.4%| |Both good|11.79|11.89|
|Both poor| | | |No majority|17.79|6.9%|  # Table S: Ratio of distinct
to total tri-grams for MSMARCO and Jeopardy QGen  | |Gold|BART|RAG-
Token|R...


In [None]:
print(page_nodes[0].get_content())

# Document 1

his works are considered classics of American literature. His wartime experience formed the basis for his novel "A Farewell to Arms" (1929).

community His debut novel, "The Sun Also Rises" artists of the 1920s "Lost Generation" expatriate, was published in 1926.

# Figure 2

RAG-Token document posterior p(za/z, yi, y-i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises."

# Table 3: Examples from generation tasks.

|Task|Input|Model|Generation|
|---|---|---|---|
|define middle ear| |BART|'The middle ear is the part of the ear between the middle ear and the nose.'|
| | |RAG-T|The middle ear includes the tympanic cavity and the three ossicles. The middle ear is the portion of the ear internal to the eardrum.|
| | |RAG-S| |
|what currency needed in Scotland| |BART|The currency needed in Scotland is Po

In [None]:
print(page_nodes[-1].get_content())

# Table 4: Human assessments for the Jeopardy generation tasks

|Question Generation Task|Factuality|Specificity|
|---|---|---|
|BART better|7.1%|16.89|
|RAG better|42.7%|37.4%|
|Both good|11.79|11.89|
|Both poor| | |
|No majority|17.79|6.9%|

# Table S: Ratio of distinct to total tri-grams for MSMARCO and Jeopardy QGen

| |Gold|BART|RAG-Token|RAG-Sequence|
|---|---|---|---|---|
| |89.69|70.7%|77.8%|83.59|
| |90.0%|32.49|46.8%|53.89|

# Table 6: Ablations On the dev set As FEVER is 4 classification task; both RAG models are equivalent

|Model|NQ|TQA|WQ|CT|Jeopardy-QGen|MSMarco|FVR-3|FVR-2|
|---|---|---|---|---|---|---|---|---|
|RAG-Token-BM2S|29.7|41.5|32.1|33.1|17.5|22.3|55.5|48.4|
|RAG-Sequence-BM2S|31.8|44.1|36.6|33.8|11.1|19.5|56.5|46.9|
|RAG-Token-Frozen|37.8|50.1|37.1|51.1|16.7|21.7|55.9|49.4|
|RAG-Sequence-Frozen|41.2|52.1|41.8|52.6|11.8|19.6|56.7|47.3|
|RAG-Token|43.5|54.8|46.5|51.9|17.9|22.6|56.2|49.4|
|RAG-Sequence|44.0|55.8|44.9|53.4|15.3|21.5|57.2|47.5|

# Effect of Retriev

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

# Splits a markdown document into Text Nodes and Index Nodes corresponding to embedded objects (e.g. tables)
node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4-turbo"), num_workers=8
)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

1it [00:00, 547.13it/s]
3it [00:00, 2403.15it/s]


In [None]:
print("********************** sample nodes **********************")
print(len(base_nodes))
print()
print(base_nodes[0].get_content())
print("********************************************")
print(base_nodes[-1].get_content())
print()

print("********************** objects' content **********************")
print(len(objects))
print(objects[0].get_content())
print()
print(objects[-1].get_content())

********************** sample nodes **********************
6

Document 1

his works are considered classics of American literature. His wartime experience formed the basis for his novel "A Farewell to Arms" (1929).

community His debut novel, "The Sun Also Rises" artists of the 1920s "Lost Generation" expatriate, was published in 1926.

 Figure 2

RAG-Token document posterior p(za/z, yi, y-i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises."

 Table 3: Examples from generation tasks.
********************************************
Effect of Retrieving more documents

Models are trained with either 5 or 10 retrieved latent documents, and we do not observe significant differences in performance between them. We have the flexibility to adjust the number of retrieved documents at test time which can affect performance

#### Vector Indexing Parsed Content

In [None]:
# dump parsed contents into the vector index
recursive_index = VectorStoreIndex(nodes = base_nodes + objects + page_nodes)

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# to prune away irrelevant nodes from the context.
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

## `LlamaParse` to Answer Questions Related to Parsed PDF

In [None]:
query1 = "What's figure 2 about?"

response1 = recursive_query_engine.query(query1)
print(response1)
print()
print(len(response1.source_nodes))
for source_node in response1.source_nodes:
  print(source_node)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[1;3;38;2;11;159;203mRetrieval entering b43a98f2-02ed-4327-adcb-3a9e837bd5e4: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What's figure 2 about?
[0mFigure 2 discusses the RAG-Token document posterior probabilities for each generated token when the input is "Hemingway" for Jeopardy question generation. It shows that the posterior probability is high for the document related to "A Farewell to Arms" and for another document when generating "The Sun Also Rises."

5
Node ID: f279fec7-087c-4427-a6ce-e5adbe09371f
Text: # Document 1  his works are considered classics of American
literature. His wartime experience formed the basis for his novel "A
Farewell to Arms" (1929).  community His debut novel, "The Sun Also
Rises" artists of the 1920s "Lost Generation" expatriate, was
published in 1926.  # Figure 2  RAG-Token document posterior p(za/z,
yi, y-i) for each g...
Score: -2.777

Node ID: b43a98f2-02ed-4327-adcb-3a9e837bd5e4
Text: The table compares the perfo

In [None]:
query2 = "Which RAG model is the best in this paper?"

response2 = recursive_query_engine.query(query2)
print(response2)
print()
print(len(response2.source_nodes))
for source_node in response2.source_nodes:
  print(source_node)

[1;3;38;2;11;159;203mRetrieval entering f7884dd3-72f7-4246-9d11-8da416b57079: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Which RAG model is the best in this paper?
[0mThe RAG-Sequence model generally shows better performance across various tasks compared to the RAG-Token model, as indicated by the results in Table 6. For instance, in the FEVER-3 classification task, the RAG-Sequence model scores 57.2, which is higher than the RAG-Token's 56.2. Similarly, in the TQA and WQ tasks, RAG-Sequence scores 55.8 and 44.9 respectively, both of which are higher than the scores of RAG-Token. Additionally, the RAG-Sequence model also demonstrates a higher percentage of distinct to total tri-grams in the Jeopardy Question Generation task compared to RAG-Token, as shown in Table S. Therefore, based on these results, the RAG-Sequence model is considered the best performing model in this paper.

5
Node ID: 410525ce-d72d-4ac8-9878-4c501936eed4
Text: Additional Result

In [None]:
query3 = "Why RAG-Sequence is better than RAG-Token?"

response3 = recursive_query_engine.query(query3)
print(response3)
print()
print(len(response3.source_nodes))
for source_node in response3.source_nodes:
  print(source_node)

[1;3;38;2;11;159;203mRetrieval entering f7884dd3-72f7-4246-9d11-8da416b57079: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Why RAG-Sequence is better than RAG-Token?
[0mRAG-Sequence tends to perform better than RAG-Token in several aspects. Firstly, it shows higher performance in open-domain question answering when more documents are retrieved at test time, as indicated by the monotonically improving results. Additionally, RAG-Sequence generally maintains a better balance between Rouge-L and Bleu-1 scores, suggesting it manages the trade-off between these metrics more effectively than RAG-Token. Furthermore, in the human assessments for the Jeopardy generation tasks, RAG-Sequence is often more factual and specific compared to RAG-Token, which is crucial for generating accurate and detailed responses. Lastly, the overall scores across various NLP tasks like NQ, TQA, WQ, and CT also tend to be higher for RAG-Sequence, indicating a more robust performanc