https://github.com/run-llama/llama_parse

https://cloud.llamaindex.ai/parse

#### Stages of RAG and our approach
https://docs.llamaindex.ai/en/stable/getting_started/concepts.html

* Loading - getting data from where it is currently into our pipeline
    * Connectors - aka Readers; used to ingest data from any source and formats into Documents & Nodes - **we're using Llamaparse**
    * Documents & Nodes - Documents are a generic container around any data source; Node is a chunk of a source `Document`. **not done separately for us**
* Indexing - structured format for easy retrieval; `VectorStoreIndex` does all of the above
    * Indexes - Data stored in format easy to retrieve, usually vector embeddings; can be stored in a vector store
    * Embeddings - Numerical representations of data; query embeddings are matched using similarity matching. Use embedding models to create this
* Querying - Ask questions and get responses -**we use query engine**
    * Retrievers - Retrieve relevant context
    * Router - Which retriever is used
    * Node Postprocessor - Apply transformations, filters and re-ranking logic to retrieved nodes
    * Response Synthesiser - Generates response from LLM using query and retrieved chunks

In [1]:
import os

import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

In [2]:
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [3]:
# instantiate parser
parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    result_type="markdown", # or text
    # num_workers=4 # for multiple files
    verbose=True,
    language="en", # default is english
)

In [4]:
# load document and parse it 
documents = parser.load_data('../data/1910.13461.pdf')

Started parsing the file under job_id 449fa1cc-a281-47dd-836b-ade1d6736062


In [5]:
# read in and parse pdf file in documents format
file_extractor = {".pdf": parser}
reader = SimpleDirectoryReader(input_files=['../data/1910.13461.pdf'], file_extractor=file_extractor)
documents = reader.load_data()

Started parsing the file under job_id 97a15eaa-0826-4e06-85b2-07d99a42b453


In [6]:
# split into nodes and create an index from parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create query engine
query_engine = index.as_query_engine()

In [7]:
query = "Tell me about R1 of bart for different datasets"

resp = query_engine.query(query)
print(resp)

R1 of BART for different datasets are as follows:
- ELI5 dataset: 30.6
- CNN/DailyMail dataset: 44.16
- XSum dataset: 45.14


In [8]:
response = query_engine.query("list all the tasks that work with bart")
print(response)

Summarization, Dialogue response generation, Abstractive QA, Translation


In [62]:
# load document and parse it 
documents = parser.load_data('../data/axis-press-release-q3fy24.pdf')

Started parsing the file under job_id 74d94431-e2f3-4a2e-9c12-b830f0aee605


In [73]:
# split into nodes and create an index from parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create query engine
query_engine = index.as_query_engine()

In [77]:
response = query_engine.query("summarize comments made by amitabh chaudhry?")
print(response)

Amitabh Chaudhry, MD&CEO of Axis Bank, mentioned that India is being looked upon as an important investment destination and that the Indian economic momentum has been strong. He emphasized that Axis Bank's focus is on sustainable and inclusive growth, with the customer being at the center of every discussion. He also mentioned the celebration of 'Sparsh Week', which involved educative customer-centric activities across multiple branches and retail asset centers, reaching out to a large number of employees.


https://ai.gopubby.com/llamaparse-rag-beats-all-comers-60948c6cc0e4

#### Using MarkdownElementNodeParser

#### Stages of RAG and our approach
https://docs.llamaindex.ai/en/stable/getting_started/concepts.html

* Loading - getting data from where it is currently into our pipeline
    * Connectors - aka Readers; used to ingest data from any source and formats into Documents & Nodes - **we're using Llamaparse**
    * Documents & Nodes - Documents are a generic container around any data source; Node is a chunk of a source `Document`. **using `MarkdownElementNodeParser`**
* Indexing - structured format for easy retrieval; `VectorStoreIndex` does all of the above
    * Indexes - Data stored in format easy to retrieve, usually vector embeddings; can be stored in a vector store
    * Embeddings - Numerical representations of data; query embeddings are matched using similarity matching. Use embedding models to create this
* Querying - Ask questions and get responses -**we use query engine**
    * Retrievers - Retrieve relevant context
    * Router - Which retriever is used
    * Node Postprocessor - Apply transformations, filters and re-ranking logic to retrieved nodes
    * Response Synthesiser - Generates response from LLM using query and retrieved chunks

In [12]:
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
# split markdown into text and index nodes
from llama_index.core.node_parser import MarkdownElementNodeParser

In [70]:
# global settings parameters
# https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings.html
Settings.llm = OpenAI(temperature=0.2, model='gpt-3.5-turbo-0613')

# can change dimensions of LLMS
Settings.embed_model = OpenAIEmbedding(
    model='text-embedding-ada-002',
    embed_batch_size=100,
    )

In [71]:
# documents loaded using llamaparse
# load document and parse it 
# documents = parser.load_data('../data/1910.13461.pdf')

# use MarkdownElementNodeParser
# https://docs.llamaindex.ai/en/stable/api/llama_index.core.node_parser.MarkdownElementNodeParser.html#markdownelementnodeparser
node_parser = MarkdownElementNodeParser()

In [72]:
# extracts hierarchical graph of nodes & tables; generates summary for tables using llm
# 
raw_nodes = node_parser.get_nodes_from_node(documents)

AttributeError: 'list' object has no attribute 'get_content'

In [67]:
base_nodes, node_mapping = node_parser.get_nodes_from_node(raw_nodes)

AttributeError: 'list' object has no attribute 'get_content'

In [68]:
example_20 = [b for b in base_nodes if isinstance(b, IndexNode)][20]

NameError: name 'IndexNode' is not defined

In [69]:
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(base_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)
vector_query_engine = vector_index.as_query_engine(similarity_top_k=1)

Retrying llama_index.embeddings.openai.base.get_embeddings in 0.7438570728827489 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'This model does not support specifying dimensions.', 'type': 'invalid_request_error', 'param': None, 'code': None}}.
Retrying llama_index.embeddings.openai.base.get_embeddings in 1.8623587079446124 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'This model does not support specifying dimensions.', 'type': 'invalid_request_error', 'param': None, 'code': None}}.
Retrying llama_index.embeddings.openai.base.get_embeddings in 0.8602852819719353 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'This model does not support specifying dimensions.', 'type': 'invalid_request_error', 'param': None, 'code': None}}.
Retrying llama_index.embeddings.openai.base.get_embeddings in 3.108959986477104 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'This model

KeyboardInterrupt: 