<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/try_pymupdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Try PyMuPDF on Multimodal PDF


#### Observations
* Totally free, doesn't set daily parsing limitation as LlamaParse.
* It doesn't support [the PDF worked in LlamaParse][3], maybe because PyMuPDF does not have built-in optical character recognition (OCR) capabilities. This means it cannot directly extract text from scanned documents or image-based PDFs.
  * Same for its integration in LlamaIndex.
* 1 page in PDF is different from PyMuPDF's 1 page.


#### References
* [The PDF file][1]
* [About PyMuPDF][2]



[1]:https://github.com/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/dataset/sample_paper.pdf
[2]:https://pymupdf.readthedocs.io/en/latest/installation.html
[3]:https://github.com/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/dataset/page78.pdf

In [None]:
!pip install --upgrade pymupdf
!pip install llama-index
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git

## Parse PDF Elements

In [4]:
import pymupdf

doc = pymupdf.open("sample_paper.pdf")

#### Parse PDF Text

In [20]:
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    print(text)
    print()
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

b'Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis\xe2\x80\xa0\xe2\x80\xa1, Ethan Perez\xe2\x8b\x86,\nAleksandra Piktus\xe2\x80\xa0, Fabio Petroni\xe2\x80\xa0, Vladimir Karpukhin\xe2\x80\xa0, Naman Goyal\xe2\x80\xa0, Heinrich K\xc3\xbcttler\xe2\x80\xa0,\nMike Lewis\xe2\x80\xa0, Wen-tau Yih\xe2\x80\xa0, Tim Rockt\xc3\xa4schel\xe2\x80\xa0\xe2\x80\xa1, Sebastian Riedel\xe2\x80\xa0\xe2\x80\xa1, Douwe Kiela\xe2\x80\xa0\n\xe2\x80\xa0Facebook AI Research; \xe2\x80\xa1University College London; \xe2\x8b\x86New York University;\nplewis@fb.com\nAbstract\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when \xef\xac\x81ne-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags behind task-speci\xef\xac\x81c architectures. Additionally, providing prov

#### Parse Images

In [21]:
for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    image_list = page.get_images()

    # print the number of images found on the page
    if image_list:
        print(f"Found {len(image_list)} images on page {page_index}")
    else:
        print("No images found on page", page_index)

    for image_index, img in enumerate(image_list, start=1): # enumerate the image list
        xref = img[0] # get the XREF of the image
        pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

        if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

        pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
        pix = None

No images found on page 0
No images found on page 1
No images found on page 2
No images found on page 3
No images found on page 4
No images found on page 5
Found 1 images on page 6
No images found on page 7
No images found on page 8
No images found on page 9
No images found on page 10
No images found on page 11
No images found on page 12
No images found on page 13
No images found on page 14
No images found on page 15
Found 1 images on page 16
No images found on page 17
No images found on page 18


## Try Integration with LlamaIndex

In [5]:
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="sample_paper.pdf")
documents

[Document(id_='afd77ffc-2b29-4fa4-91c6-25c878eb5c3a', embedding=None, metadata={'total_pages': 19, 'file_path': 'sample_paper.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis†‡, Ethan Perez⋆,\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n†Facebook AI Research; ‡University College London; ⋆New York University;\nplewis@fb.com\nAbstract\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags be

## Parsed Text for Q&A with LlamaParse

In [6]:
import nest_asyncio
nest_asyncio.apply()
import os


os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."  # FILL YOUR OWN LLAMA CLOUD API KEY
os.environ["OPENAI_API_KEY"] = "sk-..."  # FILL YOUR OWN OPENAI API KEY

In [7]:
from llama_index.core.node_parser import MarkdownElementNodeParser
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI

# Splits a markdown document into Text Nodes and Index Nodes corresponding to embedded objects (e.g. tables)
node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4-turbo"), num_workers=8
)


def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [8]:
page_nodes = get_page_nodes(documents)

print(len(page_nodes))
print()
print(page_nodes[0])
print()
print(page_nodes[-1])
print()
print(page_nodes[0].get_content())

19

Node ID: ffa92878-84eb-44d5-8d35-cbfba9ae4c51
Text: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆, Aleksandra Piktus†, Fabio Petroni†,
Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†, Mike Lewis†,
Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York
University; plewis@fb...

Node ID: 78cb5039-23f2-49b5-b35b-a091724803e8
Text: Table 7: Number of instances in the datasets used. *A hidden
subset of this data is used for evaluation Task Train Development Test
Natural Questions 79169 8758 3611 TriviaQA 78786 8838 11314
WebQuestions 3418 362 2033 CuratedTrec 635 134 635 Jeopardy Question
Generation 97392 13714 26849 MS-MARCO 153726 12468 101093* FEVER-3-way
145450 10000 10...

Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mi

In [9]:
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
print()
print(len(base_nodes))
print(len(objects))

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]


30
0





In [10]:
print("********************** sample nodes **********************")
print(base_nodes[0].get_content())
print("********************************************")
print(base_nodes[-1].get_content())
print()

********************** sample nodes **********************
Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric
memory hav

In [11]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker


# dump parsed contents into the vector index
recursive_index = VectorStoreIndex(nodes = base_nodes + objects + page_nodes)

# to prune away irrelevant nodes from the context.
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [12]:
query1 = "What's figure 2 about?"

response1 = recursive_query_engine.query(query1)
print(response1)
print()
print(len(response1.source_nodes))
for source_node in response1.source_nodes:
  print(source_node)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Figure 2 illustrates the RAG-Token document posterior probabilities for each generated token when inputting "Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior is high for document 1 when generating "A Farewell to Arms" and for document 2 when generating "The Sun Also Rises".

5
Node ID: 65f91be1-6590-4f3a-9156-98f3e0894bdb
Text: Document 1: his works are considered classics of American
literature ... His wartime experiences formed the basis for his novel
”A Farewell to Arms” (1929) ... Document 2: ... artists of the 1920s
”Lost Generation” expatriate community. His debut novel, ”The Sun Also
Rises”, was published in 1926. BOS ” The Sun Also R ises ” is a novel
by this a...
Score: -1.435

Node ID: 175c6968-27e5-4019-921f-3c3610758fb1
Text: Document 1: his works are considered classics of American
literature ... His wartime experiences formed the basis for his novel
”A Farewell to Arms” (1929) ... Document 2: ... artists of the 1920s
”Lost Generation” expatriat

In [13]:
query2 = "Which RAG model is the best in this paper?"

response2 = recursive_query_engine.query(query2)
print(response2)
print()
print(len(response2.source_nodes))
for source_node in response2.source_nodes:
  print(source_node)

RAG-Token may perform best because it can generate responses that combine content from several documents.

5
Node ID: eae90d5b-f88f-479b-a633-f8b657b82e89
Text: Table 3 shows typical generations from each model. Jeopardy
questions often contain two separate pieces of information, and RAG-
Token may perform best because it can generate responses that combine
content from several documents. Figure 2 shows an example. When
generating “Sun”, the posterior is high for document 2 which mentions
“The Sun Also R...
Score: -0.910

Node ID: 4db0505e-a8f9-49c3-93fb-037edeab7d7b
Text: Broader Impact This work offers several positive societal
beneﬁts over previous work: the fact that it is more strongly grounded
in real factual knowledge (in this case Wikipedia) makes it
“hallucinate” less with generations that are more factual, and offers
more control and interpretability. RAG could be employed in a wide
variety of scenarios ...
Score: -2.454

Node ID: bbfc32f2-62ac-4b14-96df-7e493c682180
Text: Br

In [14]:
query3 = "Why RAG-Sequence is better than RAG-Token?"

response3 = recursive_query_engine.query(query3)
print(response3)
print()
print(len(response3.source_nodes))
for source_node in response3.source_nodes:
  print(source_node)

RAG-Sequence is better than RAG-Token because it outperforms BART on Open MS-MARCO NLG by 2.6 Bleu points and 2.6 Rouge-L points.

5
Node ID: 08b43b7f-17e1-430f-ae00-d7b4b16d7f5c
Text: Table 4: Human assessments for the Jeopardy Question Generation
Task. Factuality Speciﬁcity BART better 7.1% 16.8% RAG better 42.7%
37.4% Both good 11.7% 11.8% Both poor 17.7% 6.9% No majority 20.8%
20.1% Table 5: Ratio of distinct to total tri-grams for generation
tasks. MSMARCO Jeopardy QGen Gold 89.6% 90.0% BART 70.7% 32.4% RAG-
Token 77.8% 46...
Score:  2.118

Node ID: 08435a2c-19b6-4bbb-b118-a3d2ab881f6d
Text: Table 1: Open-Domain QA Test Scores. For TQA, left column uses
the standard test set for Open- Domain QA, right column uses the TQA-
Wiki test set. See Appendix D for further details. Model NQ TQA WQ CT
Closed Book T5-11B [52] 34.5 - /50.1 37.4 - T5-11B+SSM[52] 36.6 -
/60.5 44.7 - Open Book REALM [20] 40.4 - / - 40.7 46.8 DPR [26] 41.5
57.9/ - 41...
Score:  2.020

Node ID: 5d0b596b-7f51-4888-8