# Advanced RAG with LlamaParse

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a complete walkthrough for using LlamaParse with advanced indexing/retrieval techniques in LlamaIndex over the Apple 10K Filing.

This allows us to ask sophisticated questions that aren't possible with "naive" parsing/indexing techniques with existing models.

Note for this example, we are using the `llama_index >=0.10.4` version

In [None]:
!pip install llama-index
!pip install llama-index-core==0.10.6.post1
!pip install llama-index-embeddings-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse

In [None]:
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O apple_2021_10k.pdf

Some OpenAI and LlamaParse details

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

## Using brand new `LlamaParse` PDF reader for PDF Parsing

we also compare two different retrieval/query engine strategies:
1. Using raw Markdown text as nodes for building index and apply simple query engine for generating the results;
2. Using `MarkdownElementNodeParser` for parsing the `LlamaParse` output Markdown results and building recursive retriever query engine for generation.

In [None]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./apple_2021_10k.pdf")

Started parsing the file under job_id cac11eca-71db-4dab-b72b-c67d31e551f3


In [None]:
from copy import deepcopy
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex


def get_page_nodes(docs, separator="\n---\n"):
    """Split each document into page node, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split(separator)
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [None]:
page_nodes = get_page_nodes(documents)

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [None]:
objects[0].get_content()

"This table provides information about a company's state of incorporation or organization and its corresponding I.R.S. Employer Identification Number.,\nwith the following table title:\nCompany Incorporation Information,\nwith the following columns:\n- California: None\n- 94-2404110: None\n"

In [None]:
# dump both indexed tables and page text into the vector index
recursive_index = VectorStoreIndex(nodes=base_nodes + objects + page_nodes)

In [None]:
print(page_nodes[31].get_content())

# Apple Inc.

**CONSOLIDATED STATEMENTS OF OPERATIONS (In millions, except number of shares which are reflected in thousands and per share amounts)**
| |September 25, 2021|September 26, 2020|September 28, 2019|
|---|---|---|---|
|Net sales:|$297,392|$220,747|$213,883|
|Products| | | |
|Services|$68,425|$53,768|$46,291|
|Total net sales|$365,817|$274,515|$260,174|
|Cost of sales:| | | |
|Products|$192,266|$151,286|$144,996|
|Services|$20,715|$18,273|$16,786|
|Total cost of sales|$212,981|$169,559|$161,782|
|Gross margin|$152,836|$104,956|$98,392|
|Operating expenses:| | | |
|Research and development|$21,914|$18,752|$16,217|
|Selling, general and administrative|$21,973|$19,916|$18,245|
|Total operating expenses|$43,887|$38,668|$34,462|
|Operating income|$108,949|$66,288|$63,930|
|Other income/(expense), net|$258|$803|$1,807|
|Income before provision for income taxes|$109,207|$67,091|$65,737|
|Provision for income taxes|$14,527|$9,680|$10,481|
|Net income|$94,680|$57,411|$55,256|
|Earning

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker], verbose=True
)

In [None]:
print(len(nodes))

233


## Setup Baseline

For comparison, we setup a naive RAG pipeline with default parsing and standard chunking, indexing, retrieval.

In [None]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["apple_2021_10k.pdf"])
base_docs = reader.load_data()
raw_index = VectorStoreIndex.from_documents(base_docs)
raw_query_engine = raw_index.as_query_engine(
    similarity_top_k=5, node_postprocessors=[reranker]
)

## Using `new LlamaParse` as pdf data parsing methods and retrieve tables with two different methods
we compare base query engine vs recursive query engine with tables

### Table Query Task: Queries for Table Question Answering

In [None]:
query = "Purchases of marketable securities in 2020"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The purchases of marketable securities in 2020 amounted to $163.4 billion.
[1;3;38;2;11;159;203mRetrieval entering 59368b87-e602-4bd1-88a7-7526fd6ab83f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Purchases of marketable securities in 2020
[0m[1;3;38;2;11;159;203mRetrieval entering dfd97f47-eb4d-4bab-8a22-9bbbc0096a4b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Purchases of marketable securities in 2020
[0m
***********New LlamaParse+ Recursive Retriever Query Engine***********
$114,938


In [None]:
print(response_2.source_nodes[2].get_content())

This table provides information on hedged assets and liabilities for the years 2021 and 2020, including current and non-current marketable securities and term debt.,
with the following table title:
Hedged Assets and Liabilities Summary,
with the following columns:
- 2021: None
- 2020: None

| |2021|2020|
|---|---|---|
|Hedged assets/(liabilities):| | |
|Current and non-current marketable securities|$15,954|$16,270|
|Current and non-current term debt|$(17,857)|$(21,033)|



In [None]:
query = "effective interest rates of all debt issuances in 2021"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
0.03%, 0.75%, 1.43%
[1;3;38;2;11;159;203mRetrieval entering a5afa785-217f-4e72-87cf-15da11632ec0: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query effective interest rates of all debt issuances in 2021
[0m
***********New LlamaParse+ Recursive Retriever Query Engine***********
0.48% – 0.63%, 0.03% – 4.78%, 0.75% – 2.81%, 1.43% – 2.86%


In [None]:
print(response_1.source_nodes[0].get_content())

Term Debt
As of September 25, 2021 , the Company had outstanding floating- and fixed-rate notes with varying maturities for an aggregate 
principal amount of $118.1 billion  (collectively the “Notes”). The Notes are senior unsecured obligations and interest is payable in 
arrears. The following table provides a summary of the Company’s term debt as of September 25, 2021  and September 26, 
2020 :
Maturities
(calendar year)2021 2020
Amount
(in millions)Effective
Interest RateAmount
(in millions)Effective
Interest Rate
2013 – 2020 debt issuances:
Floating-rate notes  2022 $ 1,750 0.48%  – 0.63% $ 2,250 0.60%  – 1.39%
Fixed-rate 0.000%  – 4.650%  notes 2022  – 2060  95,813 0.03%  – 4.78%  103,828 0.03%  – 4.78%
Second quarter 2021 debt issuance:
Fixed-rate 0.700%  – 2.800%  notes 2026  – 2061  14,000 0.75%  – 2.81%  —  — %
Fourth quarter 2021 debt issuance:
Fixed-rate 1.400%  – 2.850%  notes 2028  – 2061  6,500 1.43%  – 2.86%  —  — %
Total term debt  118,063  106,078 
Unamortized premium/

In [None]:
query = "Impacts of the U.S. Tax Cuts and Jobs Act of 2017 on income taxes in 2020"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
The U.S. Tax Cuts and Jobs Act of 2017 had an impact on income taxes in 2020, as evidenced by a decrease in the provision for income taxes compared to the prior year.
[1;3;38;2;11;159;203mRetrieval entering b9416f35-ebf1-45d6-9a29-b59e435ab42d: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Impacts of the U.S. Tax Cuts and Jobs Act of 2017 on income taxes in 2020
[0m[1;3;38;2;11;159;203mRetrieval entering 8d8d5733-ff30-4535-9376-7f761b5900ea: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Impacts of the U.S. Tax Cuts and Jobs Act of 2017 on income taxes in 2020
[0m[1;3;38;2;11;159;203mRetrieval entering 82f301e5-199a-4aa2-bbdf-ef97898c0326: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Impacts of the U.S. Tax Cuts and Jobs Act of 2017 on income taxes in 2020
[0m[1;3;38;2;11;159;203mRetrieval entering 86f666b4-254b-487f-9870-8ee09aef07a9: TextNod

In [None]:
print(response_1.source_nodes[0].get_content())

Other Income/(Expense), Net
The following table shows the detail of OI&E for 2021 , 2020  and 2019  (in millions):
2021 2020 2019
Interest and dividend income $ 2,843 $ 3,763 $ 4,961 
Interest expense  (2,645)  (2,873)  (3,576) 
Other income/(expense), net  60  (87)  422 
Total other income/(expense), net $ 258 $ 803 $ 1,807 
Note 5 – Income Taxe s
Provision for Income Taxes and Effective  Tax Rat e
The provision for income taxes for 2021 , 2020  and 2019 , consisted of the following (in millions):
2021 2020 2019
Federal:
Current $ 8,257 $ 6,306 $ 6,384 
Deferred  (7,176)  (3,619)  (2,939) 
Total  1,081  2,687  3,445 
State:
Current  1,620  455  475 
Deferred  (338)  21  (67) 
Total  1,282  476  408 
Foreign:
Current  9,424  3,134  3,962 
Deferred  2,740  3,383  2,666 
Total  12,164  6,517  6,628 
Provision for income taxes $ 14,527 $ 9,680 $ 10,481 
The foreign provision for income taxes is based on foreign pretax earnings of $68.7 billion , $38.1 billion  and $44.3 billion  in 2021 ,

In [None]:
query = "federal deferred tax in 2019-2021"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
$3,619 million in 2019, $7,176 million in 2020, and $1,081 million in 2021
[1;3;38;2;11;159;203mRetrieval entering 12b1355a-f9e6-4b08-a19a-3ffc00dc5b9f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query federal deferred tax in 2019-2021
[0m[1;3;38;2;11;159;203mRetrieval entering 82f301e5-199a-4aa2-bbdf-ef97898c0326: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query federal deferred tax in 2019-2021
[0m[1;3;38;2;11;159;203mRetrieval entering 8d8d5733-ff30-4535-9376-7f761b5900ea: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query federal deferred tax in 2019-2021
[0m
***********New LlamaParse+ Recursive Retriever Query Engine***********
$2,939, $3,619, $7,176


In [None]:
query = "give me the deferred state income tax in 2019-2021 (include +/-)"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
State deferred income tax for 2019: $454 million
State deferred income tax for 2020: $21 million
State deferred income tax for 2021: -$338 million
[1;3;38;2;11;159;203mRetrieval entering 12b1355a-f9e6-4b08-a19a-3ffc00dc5b9f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query give me the deferred state income tax in 2019-2021 (include +/-)
[0m[1;3;38;2;11;159;203mRetrieval entering 8d8d5733-ff30-4535-9376-7f761b5900ea: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query give me the deferred state income tax in 2019-2021 (include +/-)
[0m
***********New LlamaParse+ Recursive Retriever Query Engine***********
Deferred state income tax for the years 2019-2021:
- 2019: ($67) million
- 2020: $21 million
- 2021: ($338) million


In [None]:
print(response_2.source_nodes[0].get_content())

Summary of income tax provisions for Federal, State, and Foreign entities over the years 2019, 2020, and 2021.,
with the following table title:
Income Tax Provisions by Entity and Year,
with the following columns:
- Entity: The type of entity (Federal, State, Foreign)
- 2019: Income tax provisions for the year 2019
- 2020: Income tax provisions for the year 2020
- 2021: Income tax provisions for the year 2021

| |2021|2020|2019|
|---|---|---|---|
|Federal:| | | |
|Current|$8,257|$6,306|$6,384|
|Deferred|(7,176)|(3,619)|(2,939)|
|Total|1,081|2,687|3,445|
|State:| | | |
|Current|1,620|455|475|
|Deferred|(338)|21|(67)|
|Total|1,282|476|408|
|Foreign:| | | |
|Current|9,424|3,134|3,962|
|Deferred|2,740|3,383|2,666|
|Total|12,164|6,517|6,628|
|Provision for income taxes|$14,527|$9,680|$10,481|



In [None]:
query = "current state taxes per year in 2019-2021 (include +/-)"

response_1 = raw_query_engine.query(query)
print("\n***********Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********Basic Query Engine***********
$1,620 million in 2019, $455 million in 2020, $475 million in 2021
[1;3;38;2;11;159;203mRetrieval entering 82f301e5-199a-4aa2-bbdf-ef97898c0326: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query current state taxes per year in 2019-2021 (include +/-)
[0m[1;3;38;2;11;159;203mRetrieval entering 8d8d5733-ff30-4535-9376-7f761b5900ea: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query current state taxes per year in 2019-2021 (include +/-)
[0m[1;3;38;2;11;159;203mRetrieval entering b9416f35-ebf1-45d6-9a29-b59e435ab42d: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query current state taxes per year in 2019-2021 (include +/-)
[0m[1;3;38;2;11;159;203mRetrieval entering a029e464-575f-4dd6-afad-7cc0bbc5dbf9: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query current state taxes per year in 2019-2021 (include +/-)
[0m
***********New LlamaPa