<a href="https://colab.research.google.com/github/jaideep11061982/GenAINotebooks/blob/main/demo_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced RAG with LlamaParse

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows you how to use LlamaParse with our advanced markdown ingestion and recursive retrieval algorithms to model tables/text within a document hierarchically. This lets you ask questions over both tables and text.

Note for this example, we are using the `llama_index >=0.10.4` version

In [None]:
%pip install llama-index
%pip install llama-index-core
%pip install llama-index-embeddings-openai
%pip install llama-index-postprocessor-flag-embedding-reranker
%pip install git+https://github.com/FlagOpen/FlagEmbedding.git
%pip install llama-parse

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

Some OpenAI and LlamaParse details

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-"

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125")

Settings.llm = llm
Settings.embed_model = embed_model

## Using brand new `LlamaParse` PDF reader for PDF Parsing

we also compare two different retrieval/query engine strategies:
1. Using raw Markdown text as nodes for building index and apply simple query engine for generating the results;
2. Using `MarkdownElementNodeParser` for parsing the `LlamaParse` output Markdown results and building recursive retriever query engine for generation.

In [None]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./uber_10q_march_2022.pdf")

Started parsing the file under job_id edbcecf3-5379-40de-9c52-0d97985dccf5


In [None]:
print(documents[0].text[:1000] + "...")

# Document

# UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549

## FORM 10-Q

(Mark One)

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the quarterly period
ended March 31, 2022 OR ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from_____ to _____ Commission File Number: 001-38902

UBER TECHNOLOGIES, INC. (Exact name of registrant as specified in its charter) Not Applicable (Former name, former
address and former fiscal year, if changed since last report)

Delaware 45-2647441 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification
No.)

1515 3rd Street San Francisco, California 94158 (Address of principal executive offices, including zip code) (415)
612-8582 (Registrant’s telephone number, including area code)

Securities registered pursuant to Section 12(b) of the Act:

|Title of each class|Trading Symbol(s)|

In [None]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

Embeddings have been explicitly disabled. Using MockEmbedding.


80it [00:00, 77744.28it/s]
100%|██████████| 80/80 [00:21<00:00,  3.66it/s]


In [None]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [None]:
recursive_index = VectorStoreIndex(nodes=base_nodes + objects)
raw_index = VectorStoreIndex.from_documents(documents)

In [None]:
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker], verbose=True
)

raw_query_engine = raw_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker]
)

In [None]:
print(len(nodes))

303


## Using `new LlamaParse` as pdf data parsing methods and retrieve tables with two different methods
we compare base query engine vs recursive query engine with tables

### Table Query Task: Queries for Table Question Answering

In [None]:
query = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"

response_1 = raw_query_engine.query(query)
print("\n***********New LlamaParse+ Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********New LlamaParse+ Basic Query Engine***********
Cash paid for income taxes, net of refunds, is not explicitly provided in the context information.
[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_42_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_40_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information

![image.png](attachment:image.png)

In [None]:
query = "what is the change of free cash flow and what is the rate from the financial and operational highlights?"

response_1 = raw_query_engine.query(query)
print("\n***********New LlamaParse+ Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********New LlamaParse+ Basic Query Engine***********
The change in free cash flow from the financial and operational highlights is a decrease from $(682) million in 2021 to $(47) million in 2022. This represents a significant improvement in free cash flow performance from one period to the next.
[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_320_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query what is the change of free cash flow and what is the rate from the financial and operational highlights?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_38_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query what is the change of free cash flow and what is the rate from the financial and operational highlights?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from

![image.png](attachment:image.png)

In [None]:
query = "what is the net loss value attributable to Uber compared to last year?"

response_1 = raw_query_engine.query(query)
print("\n***********New LlamaParse+ Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********New LlamaParse+ Basic Query Engine***********
The net loss value attributable to Uber for the current period is $5.9 billion, which is an increase compared to the net loss of $108 million in the same period last year.
[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_22_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query what is the net loss value attributable to Uber compared to last year?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_316_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query what is the net loss value attributable to Uber compared to last year?
[0m[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_230_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query what is the net loss value attributable to Uber compared to last year?
[0m[1;3;38;2;11;159;203mRetrieval ente

![image.png](attachment:image.png)

In [None]:
query = "What were cash flows like from investing activities?"

response_1 = raw_query_engine.query(query)
print("\n***********New LlamaParse+ Basic Query Engine***********")
print(response_1)

response_2 = recursive_query_engine.query(query)
print("\n***********New LlamaParse+ Recursive Retriever Query Engine***********")
print(response_2)


***********New LlamaParse+ Basic Query Engine***********
Cash flows from investing activities were as follows:
- For the three months ended March 31, 2022, net cash used in investing activities was $135 million, primarily driven by $62 million in purchases of property and equipment and $59 million in acquisition of business, net of cash acquired.
- For the three months ended March 31, 2021, net cash used in investing activities was $250 million, mainly consisting of $803 million in purchases of non-marketable equity securities, $336 million in purchases of marketable securities, and $216 million in purchases of a note receivable, partially offset by proceeds from maturities and sales of marketable securities of $696 million and $500 million in proceeds from the sale of non-marketable equity securities.
[1;3;38;2;11;159;203mRetrieval entering id_b656577b-91de-47ca-981e-8b1d63e20c20_44_table: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What were cash f

![image.png](attachment:image.png)