https://github.com/run-llama/llama_parse

https://cloud.llamaindex.ai/parse

#### Stages of RAG and our approach
https://docs.llamaindex.ai/en/stable/getting_started/concepts.html

* Loading - getting data from where it is currently into our pipeline
    * Connectors - aka Readers; used to ingest data from any source and formats into Documents & Nodes - **we're using Llamaparse**
    * Documents & Nodes - Documents are a generic container around any data source; Node is a chunk of a source `Document`. **not done separately for us**
* Indexing - structured format for easy retrieval; `VectorStoreIndex` does all of the above
    * Indexes - Data stored in format easy to retrieve, usually vector embeddings; can be stored in a vector store
    * Embeddings - Numerical representations of data; query embeddings are matched using similarity matching. Use embedding models to create this
* Querying - Ask questions and get responses -**we use query engine**
    * Retrievers - Retrieve relevant context
    * Router - Which retriever is used
    * Node Postprocessor - Apply transformations, filters and re-ranking logic to retrieved nodes
    * Response Synthesiser - Generates response from LLM using query and retrieved chunks

In [1]:
import os

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

In [2]:
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [3]:
# instantiate parser
parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    result_type="markdown", # or text
    # num_workers=4 # for multiple files
    verbose=True,
    language="en", # default is english
)

In [4]:
# load document and parse it 
documents = parser.load_data('../data/1910.13461.pdf')

Started parsing the file under job_id 449fa1cc-a281-47dd-836b-ade1d6736062


In [5]:
# read in and parse pdf file in documents format
file_extractor = {".pdf": parser}
reader = SimpleDirectoryReader(input_files=['../data/1910.13461.pdf'], file_extractor=file_extractor)
documents = reader.load_data()

Started parsing the file under job_id 97a15eaa-0826-4e06-85b2-07d99a42b453


In [6]:
# split into nodes and create an index from parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create query engine
query_engine = index.as_query_engine()

In [7]:
query = "Tell me about R1 of bart for different datasets"

resp = query_engine.query(query)
print(resp)

R1 of BART for different datasets are as follows:
- ELI5 dataset: 30.6
- CNN/DailyMail dataset: 44.16
- XSum dataset: 45.14


In [8]:
response = query_engine.query("list all the tasks that work with bart")
print(response)

Summarization, Dialogue response generation, Abstractive QA, Translation


In [62]:
# load document and parse it 
documents = parser.load_data('../data/axis-press-release-q3fy24.pdf')

Started parsing the file under job_id 74d94431-e2f3-4a2e-9c12-b830f0aee605


In [73]:
# split into nodes and create an index from parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create query engine
query_engine = index.as_query_engine()

In [77]:
response = query_engine.query("summarize comments made by amitabh chaudhry?")
print(response)

Amitabh Chaudhry, MD&CEO of Axis Bank, mentioned that India is being looked upon as an important investment destination and that the Indian economic momentum has been strong. He emphasized that Axis Bank's focus is on sustainable and inclusive growth, with the customer being at the center of every discussion. He also mentioned the celebration of 'Sparsh Week', which involved educative customer-centric activities across multiple branches and retail asset centers, reaching out to a large number of employees.


https://ai.gopubby.com/llamaparse-rag-beats-all-comers-60948c6cc0e4

#### Using MarkdownElementNodeParser

In [29]:
import os

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader, Document
from llama_index.core.extractors import TitleExtractor


In [4]:
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [30]:
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=1024)


In [5]:
# instantiate parser
parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    result_type="markdown", # or text
    # num_workers=4 # for multiple files
    verbose=True,
    language="en", # default is english
)

In [6]:
# load document and parse it 
# documents = parser.load_data('../data/axis-press-release-q3fy24.pdf')

In [46]:
file_extractor = {".pdf": parser}
filename_fn = lambda filename: {"file_name": filename}
# filename_fn = 'axis bank earnings press release for Quarter ended Dec 2023'
reader = SimpleDirectoryReader(
    input_files=['../data/axis-press-release-q3fy24.pdf'], 
    file_extractor=file_extractor,
    filename_as_id=True, # to refresh documents in the index
    file_metadata = filename_fn,
    )
documents = reader.load_data()

Started parsing the file under job_id dcbe4c35-82d4-4fca-94e0-1b80d17fea3d


In [48]:
type(documents[0])

llama_index.core.schema.Document

In [42]:
# document = Document(
#     documents,
#     metadata={"filename": "axis-press-release-q3fy24",
#               "category":"press release",
#               "quarter":"q3",
#               "financial_year":"fy24",
#               },
# )

In [8]:
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser, TokenTextSplitter

In [18]:
node_parser = MarkdownElementNodeParser(llm=llm, include_metadata=True)
nodes=node_parser.get_nodes_from_documents(documents)

Embeddings have been explicitly disabled. Using MockEmbedding.


7it [00:00, 5359.64it/s]
100%|██████████| 7/7 [00:07<00:00,  1.13s/it]


In [19]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [56]:
node_parser = MarkdownElementNodeParser(llm=llm, include_metadata=True)
title_extractor = TitleExtractor(nodes=5)

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations = [
        nodes=node_parser.get_nodes_from_documents(documents),
        title_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings have been explicitly disabled. Using MockEmbedding.


7it [00:00, 1557.24it/s]
100%|██████████| 7/7 [00:05<00:00,  1.18it/s]
Parsing nodes: 100%|██████████| 1/1 [00:06<00:00,  6.14s/it]
100%|██████████| 1/1 [00:01<00:00,  1.06s/it]
100%|██████████| 5/5 [00:01<00:00,  3.02it/s]
100%|██████████| 2/2 [00:01<00:00,  1.63it/s]
100%|██████████| 4/4 [00:01<00:00,  3.49it/s]
100%|██████████| 1/1 [00:00<00:00,  1.41it/s]
100%|██████████| 1/1 [00:00<00:00,  1.55it/s]
100%|██████████| 1/1 [00:00<00:00,  1.15it/s]
100%|██████████| 1/1 [00:00<00:00,  1.48it/s]


In [59]:
nodes[0]

TextNode(id_='68b864cd-28ef-472a-84af-a381d11b85df', embedding=None, metadata={'file_name': '../data/axis-press-release-q3fy24.pdf', 'file_path': '../data/axis-press-release-q3fy24.pdf', 'document_title': 'AXIS BANK Q3 FY23 FINANCIAL RESULTS ANNOUNCEMENT AND ANALYSIS'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a35c16de-d41a-4772-b261-68011f55cce2', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d7793806f4c6e8565256d5287a2ce6af74cc609ed03e51951736ee3e9b859ff3'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='id_../data/axis-press-release-q3fy24.pdf_part_0_4_table_ref', node_type=<ObjectType.INDEX: '3'>, metadata={'col_schema': 'Column: Q3FY24\nType: Operating profit, PAT, Consolidated ROE\nSummary: None\n\nColumn: 9MFY24\nType: PAT, Consolidated ROE\nSummary: None', 'file_name': '../data/axis-press-release-q3fy24.pdf', 'file_path': '../data/axis-press-release-q3fy24.pdf'}, ha

In [53]:
base_nodes[1].metadata

{'file_name': '../data/axis-press-release-q3fy24.pdf',
 'file_path': '../data/axis-press-release-q3fy24.pdf'}

In [25]:
objects[1].metadata

{'col_schema': '',
 'file_name': '../data/axis-press-release-q3fy24.pdf',
 'file_path': '../data/axis-press-release-q3fy24.pdf'}

In [None]:
# split into nodes and create an index from parsed markdown
index = VectorStoreIndex(nodes=base_nodes+objects)

# create query engine
query_engine = index.as_query_engine()

In [None]:
query = "what is axis bank's RoA"

resp = query_engine.query(query)
print(resp)

1.84%
