## Langchain & Llamaindex RAG comparison

This notebook compares Langchain & Llamaindex for understand which method is best extraction of table & text from PDF in the following


Here we have covered

1. Langchain RAG
2. Llamaindex RAG
3. Langchain wiht llamaparser
4. Llamaindex with llamaparser


from above this method will get idea about which is best method for table extraction for the following data used


In [None]:
# install dependencies
%pip install llama-index llama-index-core llama-index-embeddings-openai llama-parse
%pip install llama-index-postprocessor-flag-embedding-reranker
%pip install git+https://github.com/FlagOpen/FlagEmbedding.git
%pip install llama-index-vector-stores-lancedb
%pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai langchain-chroma bs4 lancedb
%pip  install unstructured

Collecting llama-index
  Downloading llama_index-0.10.37-py3-none-any.whl (6.8 kB)
Collecting llama-index-core
  Downloading llama_index_core-0.10.37.post1-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai
  Downloading llama_index_embeddings_openai-0.1.9-py3-none-any.whl (6.0 kB)
Collecting llama-parse
  Downloading llama_parse-0.4.3-py3-none-any.whl (7.7 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.5-py3-none-any.whl (13 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.6-py3-none-any.whl (6.7 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-ind

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import os
import nest_asyncio

nest_asyncio.apply()

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-proj-..."

### Download the PDF (contains both tables & text)

In [None]:
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10q/uber_10q_march_2022.pdf' -O './uber_10q_march_2022.pdf'

# 1. Langchain with Q&A on PDF

In [None]:
# import modules

import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.vectorstores import LanceDB
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

loader = PyPDFLoader("/content/uber_10q_march_2022.pdf")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# LanceDB as retriever
vectorstore = LanceDB.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
# retriever chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# prompt
qa_langchain_query1 = (
    " what is the net loss value attributable to Uber compared to last year?"
)
rag_chain.invoke(qa_langchain_query1)

'The net loss value attributable to Uber Technologies, Inc. for the period was $5.9 billion, compared to $108 million in the same period the previous year. This represents a significant increase in net loss year-over-year.'

In [None]:
# prompt
qa_langchain_query2 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
rag_chain.invoke(qa_langchain_query2)

"I don't know."

In [None]:
# prompt
qa_langchain_query3 = "give me detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022"
rag_chain.invoke(qa_langchain_query3)

"I don't have detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022."

FOR QUERY 2 & QUERY 3, We didn't got any output.

**LETS TRY TO DO IT WITH LLAMAINDEX**

# 2 . Llamaindex with Q&A on PDF

In [None]:
# import modules
import textwrap
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

# data loading in llamaindex
reader = SimpleDirectoryReader(input_dir="/content/data_pdf/")

documents_pdf_loader = reader.load_data()

In [None]:
from llama_index.vector_stores.lancedb import LanceDBVectorStore

# LanceDB as retriever
vector_store_pdf = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")

In [None]:
storage_context_pdf = StorageContext.from_defaults(vector_store=vector_store_pdf)
lance_index_pdf = VectorStoreIndex.from_documents(
    documents_pdf_loader, storage_context=storage_context_pdf
)

In [None]:
# reranker
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

In [None]:
# index
lance_index_query_pdf = lance_index_pdf.as_query_engine(
    similarity_top_k=10, node_postprocessors=[reranker]
)

# query
qa_lama_query1 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
output1 = lance_index_query_pdf.query(qa_lama_query1)
print(output1.response)

In [None]:
# query
qa_lama_query2 = (
    " what is the net loss value attributable to Uber compared to last year?"
)
output2 = Lance_index_query_pdf.query(qa_lama_query2)
print(output2.response)

The net loss attributable to Uber Technologies, Inc. was $5.9 billion in the current period, compared to a net loss of $108 million in the same period last year.


In [None]:
# query
qa_lama_query3 = "give me detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022"
output3 = Lance_index_query_pdf.query(qa_lama_query3)
print(output3.response)

The detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022 are as follows:

**As of December 31, 2021:**
- Consumer, Merchant and other relationships: $1,574 million
- Developed technology: $653 million
- Trade names and trademarks: $175 million
- Patents: $8 million
- Other: $2 million
- Total Intangible assets: $2,412 million

**As of March 31, 2022:**
- Consumer, Merchant and other relationships: $1,494 million
- Developed technology: $599 million
- Trade names and trademarks: $167 million
- Patents: $7 million
- Other: $2 million
- Total Intangible assets: $2,269 million


In [None]:
# query
qa_lama_query4 = "what is Adjusted EBITDA 2021 vs 2022 ? what is intreset  expense"
output2 = Lance_index_query_pdf.query(qa_lama_query4)
print(output2.response)

Adjusted EBITDA for 2021 was a loss of $359 million, while for 2022 it improved to $168 million. Interest expense for the period increased from $115 million in 2021 to $129 million in 2022.


# 3 Llamaparser with Langchain on PDF

we are simply saving all llamaparser output in .md file & based on that we are doing Q& A. there are better methods also to add llamaparser with Langchain **but lets do this experiment**

In [None]:
import os
from llama_parse import LlamaParse

# Ensure the data folder exists
if not os.path.exists("data"):
    os.makedirs("data")

# Load data using LlamaParse
documents_LlamaParse = LlamaParse(result_type="markdown").load_data(
    "/content/uber_10q_march_2022.pdf"
)

# Open the file in append mode ('a') and write the content
with open("data/output.md", "a") as f:  # Open the file in append mode ('a')
    for doc in documents_LlamaParse:
        f.write(doc.text + "\n")

Started parsing the file under job_id 6aef2258-0e5c-4796-8ea1-f82e817cb542


In [None]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import DirectoryLoader

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

loader = DirectoryLoader("/content/data", glob="**/*.md", show_progress=True)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
docs = text_splitter.split_documents(documents)


# print(docs[])


  0%|          | 0/1 [00:00<?, ?it/s][A[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

100%|██████████| 1/1 [00:12<00:00, 12.93s/it]


In [None]:
docs[0]

Document(page_content='Document\n\nUNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549\n\nFORM 10-Q\n\n(Mark One)\n\nQUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the quarterly period ended March 31, 2022\n\nCommission File Number: 001-38902\n\nUBER TECHNOLOGIES, INC.\n\n(Exact name of registrant as specified in its charter)\n\nDelaware 45-2647441\n\n(State or other jurisdiction of incorporation or organization) 1515 3rd Street (I.R.S. Employer Identification No.)\n\nSan Francisco, California 94158\n\n(Address of principal executive offices, including zip code) (415) 612-8582\n\n(Registrant’s telephone number, including area code)\n\nSecurities registered pursuant to Section 12(b) of the Act:\n\nTitle of each class Trading Symbol(s) Name of each exchange on which registered Common Stock, par value $0.00001 per share UBER New York Stock Exchange\n\nIndicate by check mark whether the registrant (1) has filed all reports re

In [None]:
from langchain.vectorstores import LanceDB

# LanceDB as retriever
vectorstore = LanceDB.from_documents(documents=docs, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_lama = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# query
query1 = " what is the net loss value attributable to Uber compared to last year?"
rag_chain_lama.invoke(query1)

'The net loss attributable to Uber Technologies, Inc. for the first quarter of 2022 was $5.9 billion, compared to a net loss of $108 million in the same period in 2021. This represents a significant increase in net loss compared to the previous year.'

In [None]:
# query
query2 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
rag_chain_lama.invoke(query2)

'The Cash paid for Income taxes, net of refunds is not specifically mentioned in the provided context.'

In [None]:
# query
query3 = "give me detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022"
rag_chain_lama.invoke(query3)

'Detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022, are as follows:\n\nAs of December 31, 2021:\n- Consumer, Merchant and other relationships: $1,494 million\n- Developed technology: $599 million\n- Trade names and trademarks: $167 million\n- Patents: $7 million\n- Other: $2 million\n\nAs of March 31, 2022:\n- Consumer, Merchant and other relationships: $1,574 million\n- Developed technology: $653 million\n- Trade names and trademarks: $175 million\n- Patents: $8 million\n- Other: $2 million'

Till now, We are not getting all answers but now lets try the llamaparser with Llamaindex to see the results

# 4. Llamaparse with Llamaindex on PDF

In [None]:
# using llamaparser with LlamaIndex

from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)
from llama_parse import LlamaParse

pdf_table_LlamaParse = LlamaParse(result_type="markdown").load_data(
    "/content/data_pdf/uber_10q_march_2022.pdf"
)

In [None]:
# LanceDB as retriever
vector_store_lamaparser = LanceDBVectorStore(uri="/tmp/lancedb_parser")
storage_context_lamaparser = StorageContext.from_defaults(
    vector_store=vector_store_lamaparser
)

In [None]:
lance_index_lamaparser = VectorStoreIndex.from_documents(
    pdf_table_LlamaParse, storage_context=storage_context_lamaparser
)

# reranker
reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

# index
Lance_index_query_lamaparser = lance_index_lamaparser.as_query_engine(
    similarity_top_k=10, node_postprocessors=[reranker]
)


In [None]:
# query
query_parser1 = "what is Adjusted EBITDA 2021 vs 2022 ? what is intreset  expense"

response_1 = Lance_index_query_lamaparser.query(query_parser1)
print(response_1)

Started parsing the file under job_id 69ea99de-dd19-46db-86d5-e03d2027a7ce




Adjusted EBITDA improved by $527 million from a loss of $359 million in 2021 to $168 million in 2022. Interest expense increased by an immaterial amount.


In [None]:
# query
query_parser2 = "give me detailed charts of intangible assets, net as of December 31, 2021 and March 31, 2022"

response_1 = Lance_index_query_lamaparser.query(query_parser2)
print(response_1)

| |Gross Carrying Value|Accumulated Amortization|Net Carrying Value|Useful Life - Years|
|---|---|---|---|---|
|Consumer, Merchant and other relationships|$1,868|$(294)|$1,574|9|
|Developed technology|$922|$(269)|$653|5|
|Trade names and trademarks|$222|$(47)|$175|6|
|Patents|$15|$(7)|$8|7|
|Other|$5|$(3)|$2|0|

| |Gross Carrying Value|Accumulated Amortization|Net Carrying Value|Useful Life - Years|
|---|---|---|---|---|
|Consumer, Merchant and other relationships|$1,856|$(362)|$1,494|9|
|Developed technology|$924|$(325)|$599|5|
|Trade names and trademarks|$222|$(55)|$167|6|
|Patents|$15|$(8)|$7|6|
|Other|$5|$(3)|$2|0|


In [None]:
qa_lama_query4


***********New pdf+ lancedb ***********
The cash paid for income taxes, net of refunds, was $22 million for the three months ended March 31, 2021, and $41 million for the three months ended March 31, 2022.


**WOW 🤩 the answers are more clear & better compaired to other methods .thats power of Llamaparser .this Llamaparser with LlamaIndex is doing quite well on table & text data from PDF.**
