<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/try_llamaparse.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Try LlamaParse on Multimodal PDF

#### Notes
* `llama_index >=0.10.4`
* More about LlamaParse chuking, [node parser][3]

#### Observations
* Free version LlamaParse has daily parsing limitation.
* LlamaParse Premium is enabled with `premium_mode=True`, it has better table parsing, especailly when the table is nested
* The charts in this PDF can't be parsed by LlamaParse
* Even if parsed content is better doesn't mean Q&A will be better, because the answers depends on LLM (OpenAI) too.

#### References
* [LlamaParse demo_advanced notebook][1]
* [The PDF file][2]



[1]:https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
[2]:https://github.com/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/dataset/page78.pdf
[3]:https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/

In [None]:
!pip install llama-index
!pip install llama-index-core==0.10.6.post1
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse
!pip install unstructured[local-inference]
!pip install httpx==0.27.2

## Setup OpenAI and LlamaParse APIs

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()
import os


os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."  # FILL YOUR OWN LLAMA CLOUD API KEY
os.environ["OPENAI_API_KEY"] = "sk-..."  # FILL YOUR OWN OPENAI API KEY

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

# use OpenAI Embedding model and LLM model
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4-turbo")

Settings.llm = llm
Settings.embed_model = embed_model

## Using `LlamaParse` PDF reader for PDF Parsing


In [None]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown", premium_mode=True).load_data("./page78.pdf")
documents

Started parsing the file under job_id 45a15189-9c98-452d-81c5-608573b16824
..

[Document(id_='57910a7f-e777-49bb-9b68-b319da1a0352', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='\nDocument 1: his works are considered classics of American literature ... His wartime experiences formed the basis for his novel "A Farewell to Arms" (1929) ...\n\nDocument 2: ... artists of the 1920s "Lost Generation" expatriate community. His debut novel, "The Sun Also Rises", was published in 1926.\n\nFigure 2: RAG-Token document posterior p(zi|x, yi, y<i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises".\n\nTable 3: Examples from generation tasks. RAG models generate more specific and factually accurate responses. \'?\' indicates factually incorrect responses, * indicates part

#### Chunking by page separator
* The nodes are the chunks in LlamaParse

In [None]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex

def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [None]:
page_nodes = get_page_nodes(documents)

print(len(page_nodes))
print()
print(page_nodes[0])
print()
print(page_nodes[-1])

2

Node ID: 7b9a5a80-dbfb-4898-b4cf-48320fb02914
Text: Document 1: his works are considered classics of American
literature ... His wartime experiences formed the basis for his novel
"A Farewell to Arms" (1929) ...  Document 2: ... artists of the 1920s
"Lost Generation" expatriate community. His debut novel, "The Sun Also
Rises", was published in 1926.  Figure 2: RAG-Token document posterior
p(zi|x,...

Node ID: a6dd22ac-6384-4e5e-bace-73de7b3a30af
Text: ## Table 4: Human assessments for the Jeopardy Question
Generation Task.  | | Factuality | Specificity |
|-------------|------------|-------------| | BART better | 7.1% |
16.8% | | RAG better | 42.7% | 37.4% | | Both good | 11.7% | 11.8% | |
Both poor | 17.7% | 6.9% | | No majority | 20.8% | 20.1% |  ## Table
5: Ratio of distinct to total tri-gr...


In [None]:
print(page_nodes[0].get_content())


Document 1: his works are considered classics of American literature ... His wartime experiences formed the basis for his novel "A Farewell to Arms" (1929) ...

Document 2: ... artists of the 1920s "Lost Generation" expatriate community. His debut novel, "The Sun Also Rises", was published in 1926.

Figure 2: RAG-Token document posterior p(zi|x, yi, y<i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises".

Table 3: Examples from generation tasks. RAG models generate more specific and factually accurate responses. '?' indicates factually incorrect responses, * indicates partially correct responses.

| Task | Input | Model | Generation |
|------|-------|-------|------------|
| MS-MARCO | define middle ear | BART | The middle ear is the part of the ear between the middle ear and the nose. |
| | | RAG-T | The midd

In [None]:
print(page_nodes[-1].get_content())


## Table 4: Human assessments for the Jeopardy Question Generation Task.

| | Factuality | Specificity |
|-------------|------------|-------------|
| BART better | 7.1% | 16.8% |
| RAG better | 42.7% | 37.4% |
| Both good | 11.7% | 11.8% |
| Both poor | 17.7% | 6.9% |
| No majority | 20.8% | 20.1% |

## Table 5: Ratio of distinct to total tri-grams for generation tasks.

| | MSMARCO | Jeopardy QGen |
|-----------|-----------|---------------|
| Gold | 89.6% | 90.0% |
| BART | 70.7% | 32.4% |
| RAG-Token | 77.8% | 46.8% |
| RAG-Seq. | 83.5% | 53.8% |

## Table 6: Ablations on the dev set. As FEVER is a classification task, both RAG models are equivalent.

| Model | NQ | TQA | WQ | CT | Jeopardy-QGen | MSMarco | FVR-3 | FVR-2 |
|-------|----|----|----|----|----------------|---------|-------|-------|
| | | Exact Match | | | B-1 | QB-1 | R-L | B-1 | Label Accuracy |
| RAG-Token-BM25 | 29.7 | 41.5 | 32.1 | 33.1 | 17.5 | 22.3 | 55.5 | 48.4 | 75.1 | 91.6 |
| RAG-Sequence-BM25 | 31.8 | 44.1 | 

#### Chuking by markdown Elements

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

# Splits a markdown document into Text Nodes and Index Nodes corresponding to embedded objects (e.g. tables)
node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4-turbo"), num_workers=8
)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

1it [00:00, 687.59it/s]
3it [00:00, 23519.46it/s]


In [None]:
# check whether charts has been parsed
for chart in objects:
  if 'type' in chart.metadata and chart.metadata['type'] == 'table':
    print(chart.get_content())

In [None]:
print("********************** sample nodes **********************")
print(len(base_nodes))
print()
print(base_nodes[0].get_content())
print("********************************************")
print(base_nodes[-1].get_content())
print()

print("********************** objects' content **********************")
print(len(objects))
print(objects[0].get_content())
print()
print(objects[-1].get_content())

********************** sample nodes **********************
6

Document 1: his works are considered classics of American literature ... His wartime experiences formed the basis for his novel "A Farewell to Arms" (1929) ...

Document 2: ... artists of the 1920s "Lost Generation" expatriate community. His debut novel, "The Sun Also Rises", was published in 1926.

Figure 2: RAG-Token document posterior p(zi|x, yi, y<i) for each generated token for input "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises".

Table 3: Examples from generation tasks. RAG models generate more specific and factually accurate responses. '?' indicates factually incorrect responses, * indicates partially correct responses.
********************************************
between these dates and use a template "Who is [position]?" (e.g. "Who is the President of Peru?") to query ou

#### Vector Indexing Parsed Content

In [None]:
# dump parsed contents into the vector index
recursive_index = VectorStoreIndex(nodes = base_nodes + objects + page_nodes)

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# to prune away irrelevant nodes from the context.
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

## `LlamaParse` to Answer Questions Related to Parsed PDF

In [None]:
query1 = "What's figure 2 about?"

response1 = recursive_query_engine.query(query1)
print(response1)
print()
print(len(response1.source_nodes))
for source_node in response1.source_nodes:
  print(source_node)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[1;3;38;2;11;159;203mRetrieval entering 0270345c-6c86-4dc6-8324-725d9f13377f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What's figure 2 about?
[0mFigure 2 illustrates the document posterior probabilities for each token generated in a Jeopardy question about "Hemingway" using the RAG-Token model with five retrieved documents. The posterior probabilities are particularly high for document 1 when generating the phrase "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises." This indicates the relevance of these documents to the tokens being generated in the context of the input "Hemingway."

5
Node ID: e6f02534-2668-411d-8830-153a08bd7464
Text: Document 1: his works are considered classics of American
literature ... His wartime experiences formed the basis for his novel
"A Farewell to Arms" (1929) ...  Document 2: ... artists of the 1920s
"Lost Generation" expatriate community. His debut novel, "The Sun Also
Rises", was published 

In [None]:
query2 = "Which RAG model is the best in this paper?"

response2 = recursive_query_engine.query(query2)
print(response2)
print()
print(len(response2.source_nodes))
for source_node in response2.source_nodes:
  print(source_node)

[1;3;38;2;11;159;203mRetrieval entering 0270345c-6c86-4dc6-8324-725d9f13377f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Which RAG model is the best in this paper?
[0m[1;3;38;2;11;159;203mRetrieval entering 918df946-0def-41dd-a6fc-8b5adb35c442: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Which RAG model is the best in this paper?
[0mThe RAG-Sequence model generally shows the best performance across various metrics and tasks compared to other RAG model configurations.

5
Node ID: 0270345c-6c86-4dc6-8324-725d9f13377f
Text: The table compares the performance of BART and RAG in terms of
factuality and specificity, showing percentages for scenarios where
one model performs better than the other, both models perform well,
both perform poorly, or no majority opinion., with the following
columns: - Factuality: None - Specificity: None  | | Factuality |
Specificity | |--...
Score: -2.162

Node ID: 7b9a5a80-dbfb-4898-b4cf-4

In [None]:
query3 = "Why RAG-Sequence is better than RAG-Token?"

response3 = recursive_query_engine.query(query3)
print(response3)
print()
print(len(response3.source_nodes))
for source_node in response3.source_nodes:
  print(source_node)

[1;3;38;2;11;159;203mRetrieval entering 8cd54491-d843-41fc-98d7-128307a5907d: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Why RAG-Sequence is better than RAG-Token?
[0mRAG-Sequence performs better than RAG-Token in generating more diverse and factually accurate responses for Jeopardy question generation. This is evident from the higher percentage of distinct tri-grams in the generated content, indicating a greater variety in the responses produced by RAG-Sequence compared to RAG-Token. Additionally, RAG-Sequence tends to retrieve more relevant documents, which likely contributes to its enhanced performance in generating more accurate and specific responses.

5
Node ID: a6dd22ac-6384-4e5e-bace-73de7b3a30af
Text: ## Table 4: Human assessments for the Jeopardy Question
Generation Task.  | | Factuality | Specificity |
|-------------|------------|-------------| | BART better | 7.1% |
16.8% | | RAG better | 42.7% | 37.4% | | Both good | 11.7% | 11.8% | |
B