# RAG using Llamaindex


In [None]:
### Setup environment 

!python3 -m venv rag-pipeline -- quiet
!source rag/bin/activate --quiet

##### Install dependencies

!pip install -r requirements.txt --quiet

In [20]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Loader

In [21]:
from pathlib import Path

from llama_index.readers.file import PyMuPDFReader

loader = PyMuPDFReader()
docs0 = loader.load_data(file_path=Path("data/State of AI Report 2023.pdf"), metadata=True)

In [22]:
print(f" docs is a {type(docs0)}, of length {len(docs0)}, where each element is a {type(docs0[0])} object")

 docs is a <class 'list'>, of length 163, where each element is a <class 'llama_index.core.schema.Document'> object


In [34]:
for k, v in docs0[94]:
    print (f"{k}: {v}")

id_: 2904107e-66a6-4a84-9329-d672455b7d14
embedding: None
metadata: {'total_pages': 163, 'file_path': 'data/State of AI Report 2023.pdf', 'source': '95'}
excluded_embed_metadata_keys: []
excluded_llm_metadata_keys: []
relationships: {}
text:     In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring 
DALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content 
licensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and 
associated metadata for model training. Furthermore, Shutterstock will offer its customers indemniﬁcation for AI 
image creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in 
stark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its 
ongoing lawsuit against Stability AI for copyright infringement ﬁled in Feb 2023. 
stateof.

## Clean text and add metadata

In [35]:
import re

def clean_slide_text(text:str) -> str: 
    """
    Cleans the provided slide by removing specific patterns and extra whitespace. 
    
    Parameters:

    Returns: 
    """
    # Remove the footer text
    text = text.replace("stateof.ai 2023", "")

    # Remove the header text
    text = text.replace("Introduction | Research | Industry | Politics | Safety | Predictions", "")

    # Remove the pattern "#stateofai | n"
    text = re.sub(r"#stateofai(\s*\|\s*\d+)?", "", text)

    # Replace multiple consecutive spaces with a single space
    text = re.sub(r" +", " ", text)

    # Remove any leading or trailing whitespace
    text = text.strip()

    return text

In [12]:
def assign_section(document):
    """
    Assigns a section to the document based on its page number.

    The function updates the 'metadata' attribute of the document with a key 'section'
    that has a value corresponding to the section the page number falls into.

    Sections:
    - Page 1 through 10: Introduction
    - Page 11 through 68: Research
    - Page 69 through 120: Politics
    - Page 121 through 137: Safety
    - Pages 138 and beyond: Predictions

    Args:
    - document (Document): The Document object to be updated.

    Returns:
    None. The function updates the Document object in-place.
    """

    page_number = int(document.metadata['source'])

    if 1 <= page_number <= 10:
        document.metadata['section'] = 'Introduction'
    elif 11 <= page_number <= 68:
        document.metadata['section'] = 'Research'
    elif 69 <= page_number <= 120:
        document.metadata['section'] = 'Politics'
    elif 121 <= page_number <= 137:
        document.metadata['section'] = 'Safety'
    else:
        document.metadata['section'] = 'Predictions'

In [36]:
# Iterate through each Document object in docs0
for doc in docs0:
    # Update the metadata using assign_section
    assign_section(doc)

    # Metadata keys that are excluded from text for the embed model.
    doc.excluded_embed_metadata_keys=['file_name']

    # Apply clean_slide_text to the text attribute1
    doc.text = clean_slide_text(doc.text)
    
print (docs0[94].text)

In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring 
DALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content 
licensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and 
associated metadata for model training. Furthermore, Shutterstock will offer its customers indemniﬁcation for AI 
image creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in 
stark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its 
ongoing lawsuit against Stability AI for copyright infringement ﬁled in Feb 2023. 


 
2022 Prediction: A major user generated content site negotiates a commercial 
settlement with a start-up producing AI models (e.g. OpenAI) for training on their corpus
vs.


In [37]:
docs0[94].metadata

docs0[94].get_content()

{'total_pages': 163,
 'file_path': 'data/State of AI Report 2023.pdf',
 'source': '95',
 'section': 'Politics'}

"In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring \nDALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content \nlicensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and \nassociated metadata for model training. Furthermore, Shutterstock will offer its customers indemniﬁcation for AI \nimage creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in \nstark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its \nongoing lawsuit against Stability AI for copyright infringement ﬁled in Feb 2023. \n\n\n \n2022 Prediction: A major user generated content site negotiates a commercial \nsettlement with a start-up producing AI models (e.g. OpenAI) for training on their corpus\nvs."

## Chunking Strategies: Options

Two options here: 
1. Directly send the entire Document object to the index
    - Maintains entire document as a single unit 
    - Useful when documents are relatively short and contexts between different parts of the document is important 
2. Covert the Document into Node objects before sending them to the index
    - Practical when the documents are long and require breaking down into chunks (or nodes) before indexing
    - Useful to retrieve specific parts of a document than the entire document

## Convert Document object to Node: Node and NodeParser

- A Node represents a chunk of a source document 
- Node contain metadata and relationship information with other nodes
- Nodes are first-class citizens in LlamaIndex, this means Nodes and their attributes can be defined directly
- Every node derived from a Document will inherit the same metadata from that Document
- Alternatively, we can parse source Documents into Nodes using the NodeParser classes. 


**Chunk Size:** 

Choosing the optimal chunk_size provides optimal results 
- Smaller chunk_size provides granular chunks, but we risk that the essential information might not be be among the top retrived chunks
- Larger chunk size might contain all necessary infromation within the top chunks 
- Increase in chunk size directs more information into the LLM. This ensures a comprehensive context but might slow down the system. 


In [38]:
import re

# Define the pattern for bullet points and newlines
split_pattern = r"\n●|\n-|\n"

# Initialize lists to store the word counts of all chunks and entire texts across all documents
chunk_word_counts = []
entire_text_word_counts = []

# Initialize a dictionary to store word counts and slide counts by section
section_data = {}

# Iterate through each Document object in your list of documents
for doc in docs0:
    # Split the document's text into chunks based on the pattern
    chunks = re.split(split_pattern, doc.text)

    # Calculate the number of words in each chunk and store it
    chunk_word_counts.extend([len(chunk.split()) for chunk in chunks])

    # Calculate the number of words in the entire text and store it
    entire_word_count = len(doc.text.split())
    entire_text_word_counts.append(entire_word_count)

    # Update the word count and slide count for the section in the dictionary
    section = doc.metadata['section']
    if section in section_data:
        section_data[section]['word_count'] += entire_word_count
        section_data[section]['slide_count'] += 1
    else:
        section_data[section] = {'word_count': entire_word_count, 'slide_count': 1}

# Calculate the total word count across all sections
total_word_count = sum(data['word_count'] for data in section_data.values())

# Calculate the number of sections
num_sections = len(section_data)

# Calculate the average word count across all sections
average_word_count_across_sections = total_word_count / num_sections

# Calculate summary statistics for chunks
average_chunk_word_count = sum(chunk_word_counts) / len(chunk_word_counts)
max_chunk_word_count = max(chunk_word_counts)

# Calculate average word count for entire texts
average_entire_text_word_count = sum(entire_text_word_counts) / len(entire_text_word_counts)

print(f"Average word count for a slide: {average_entire_text_word_count}")
print(f"Average word count per bullet point: {average_chunk_word_count}")
print(f"Longest bullet point: {max_chunk_word_count}")
print(f"Average word count in a section: {average_word_count_across_sections:.2f}")

Average word count for a slide: 128.12269938650306
Average word count per bullet point: 9.252131000448632
Longest bullet point: 28
Average word count in a section: 4176.80


### Chunking Strategy

- *NodeParsers* are a simple abstraction that take a list of documents and chunk them into Node objects. 
Each *Node* is a specific chunk of the parent document.
- Strategy: Utilize smaller child chunks that refer to bigger parent chunks.
    - Use *SimpleNodeParser* with a *SentenceSplitter* to create "base nodes" aka parent chunks
    - Use *SentenceWindowNodeParser* to create child nodes that represent bullet points in the slide deck along with metadata

In [39]:
from llama_index.core.text_splitter import SentenceSplitter
from llama_index.core.node_parser import SimpleNodeParser
from pathlib import Path

# bullet_splitter = SentenceSplitter(paragraph_separator=r"\n●|\n-|\n", chunk_size=250)


parser = SentenceSplitter.from_defaults(
                chunk_size=250,
                paragraph_separator=r"\n●|\n-|\n",
                include_metadata=True,
                include_prev_next_rel=True)

slides_nodes = parser.get_nodes_from_documents(docs0)

In [49]:
slides_nodes[94]
len(slides_nodes[94].text)
slides_nodes[94].text


TextNode(id_='d7f2cfb4-e053-4e58-b7d7-6358336c7035', embedding=None, metadata={'total_pages': 163, 'file_path': 'data/State of AI Report 2023.pdf', 'source': '29', 'section': 'Research'}, excluded_embed_metadata_keys=['file_name'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='77577cc4-2cce-48be-99ed-7bbde7d42608', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'total_pages': 163, 'file_path': 'data/State of AI Report 2023.pdf', 'source': '29', 'section': 'Research'}, hash='3b856327adfb9de7690d0e9932df2c2cba0c5eb4c90e04e9420ff8e9b93116c5'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='4cfc1444-1007-483b-b020-b457cd07fc6b', node_type=<ObjectType.TEXT: '1'>, metadata={'total_pages': 163, 'file_path': 'data/State of AI Report 2023.pdf', 'source': '29', 'section': 'Research'}, hash='bdb611ea7014bf3d76388dfd64916a160c3e5d703ec6633ee60f9ffa751df940'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='b5047b2a-949b-4

1121

'We’re nowhere near a deﬁnitive answer: Synthetic data is becoming more helpful, but \nthere is still evidence showing that in some cases generated data makes models forget.\n\n● Despite the seemingly inﬁnitely proprietary and publicly available data, the largest models are actually running \nout of data to train on, and testing the limits of scaling laws. One way to alleviate this problem (which has been \nextensively explored in the past) is to train on AI-generated data, whose volume is only bounded by compute.\n\nBreaking the data ceiling: AI-generated content\n \n● Researchers from Google ﬁne-tune the Imagen text-to-image model for \nclass-conditional ImageNet, then generated one to 12 synthetic versions \nof ImageNet on which they trained their models (in addition to the \noriginal ImageNet). They showed that increasing the size of the synthetic \ndataset monotonically improved the model’s accuracy.\n● Other researchers showed that the compounding errors from training on \nsynthe

## SentenceWindowNodeParser

The Sentence Window Node Parser parses the sentences in the document into Nodes and captures a window of surrounding sentences for each node. Understanding the context of a sentence can provide valuable insights. 

Each node contains a window of surrounding sentences in its metadata. The window size is defined by the window_size attribute, which defaults to 3. This means that for each sentence, the parser will include the 3 sentences before and after it in the metadata.

In [51]:
from llama_index.core.node_parser import SentenceWindowNodeParser
from typing import List
import re

def custom_sentence_splitter(text: str) -> List[str]:
    return re.split(r'\n●|\n-|\n', text)

bullet_node_parser = SentenceWindowNodeParser.from_defaults(
    sentence_splitter=custom_sentence_splitter,
    window_size=3,
    include_prev_next_rel=True,
    include_metadata=True
    )

## IndexNode in LlamaIndex

An IndexNode is a node object used in LlamaIndex. The Index is a data structure that allows for quick retrieval of relevant context for a user query, which is fundamental for retrieval-augmented generation (RAG) use cases. At its core, the IndexNode inherits properties from a TextNode, meaning it primarily represents textual content.

Every IndexNode has an index_id attribute. This index_id acts as a unique identifier or reference to another object, allowing the node to point or link to other entities within the system, providing a layer of connectivity on top of the textual content. Connected chunks allow for more context for synthesis. 

IndexNode inherits its textual content from the TextNode and serves as a pointer for other entities in the system, allowing nodes to represent both content and relationships to other objects. 

In [53]:
from llama_index.core.schema import IndexNode

sub_node_parsers =[bullet_node_parser]

all_nodes = []

for base_node in slides_nodes:
    for parser in sub_node_parsers:
        sub_nodes = parser.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)

## Embedding Model and LLM 

In [55]:
from llama_index.core.embeddings import resolve_embed_model

embed_model = resolve_embed_model("local:BAAI/bge-large-en-v1.5")

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [56]:
# Define a new prompt template
template = """Below is context that has been retrieved. Your task is to synthesize \

the query, which is delimited by triple backticks,  and write a response that appropriately answers the query based on the retrieved context.

### Query:
```{query_str}```

### Response:

Begin!
"""

In [81]:
%%capture
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts import PromptTemplate

llm = HuggingFaceLLM(
    model_name="Deci/DeciLM-6b-instruct",
    tokenizer_name="Deci/DeciLM-6b-instruct",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    # query_wrapper_prompt=PromptTemplate(template),
    context_window=4096,
    max_new_tokens=512,
    device_map="auto",
    model_kwargs={'trust_remote_code':True},
    generate_kwargs={"temperature": 0.1, "do_sample": False},
    # generate_kwargs={"temperature": 0.0},
)

## Settings and VectorStoreIndex 

A VectorStoreIndex in LlamaIndex is a type of index that uses vector representations of text for efficient retrieval of relevant context. The VectorStoreIndex takes as input the IndexNode objects and used the specified embedding model to convert the text content of these nodes into vector representations. These vectors are then stored in the VectorStore. 

For a given query, the VectorStoreIndex converts the query into a vector using the same embedding model and retrieves the most relevant nodes from the VectorStore for the query using the nearest neighbor search. 

In [60]:

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

from llama_index.core import VectorStoreIndex

vector_index_chunk = VectorStoreIndex(all_nodes)

## Retriever object

A retriever is a component that is responsible for fetching relevant context from the index given a user query. When you call as_retriever on a VectorStoreIndex, it returns a VectorStoreRetriever object. 

The RecursiveRetriever is can handle complex retrieval task by exploring links between nodes, fetching data from connected retrievers or query engines recursively. It efficiently retrieves information from various sources, including IndexNodes, consolidating results from multiple sources to provide comprehensive responses for complex retrieval tasks.

In [65]:
from llama_index.core.retrievers import RecursiveRetriever


all_nodes_dict = {n.node_id: n for n in all_nodes}


vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

- The query can either be a simple string or a structured *QueryBundle* object. 
- The *Retrieve* method accepts the query, converts it into a *QueryBundle* and calls an internal method to fetch a list of nodes based on this query. 
- Each node in the list has a confidence score in relation to the query. 
- *display_source_node* accepts a *NodeWithScore* object, consisting of nodeID, its similarity score and a truncated version of its content. 
- The retrieved text is displayed in a concise manner to the specified *source_length*. 

In [66]:
from llama_index.core.response.notebook_utils import display_source_node


nodes = retriever_chunk.retrieve(
    "What is FlashAttention?"
)
for node in nodes:
    display_source_node(node, source_length=1000)

[1;3;34mRetrieving with query id None: What is FlashAttention?
[0m[1;3;38;5;200mRetrieved node with id, entering: 10e7daff-aebb-4d6c-aecf-55411b1a905d
[0m[1;3;34mRetrieving with query id 10e7daff-aebb-4d6c-aecf-55411b1a905d: What is FlashAttention?
[0m

**Node ID:** 10e7daff-aebb-4d6c-aecf-55411b1a905d<br>**Similarity:** 0.677945985196128<br>**Text:** ● FlashAttention introduces a signiﬁcant memory saving by making attention linear instead of quadratic in 
sequence length. FlashAttention-2 further improves computing the attention matrix by having fewer non-matmul 
FLOPS, better parallelism and better work partitioning. The result is a 2.8x training speedup of GPT-style models.
● Reducing the number of bits in the parameters reduces both the memory footprint and the latency of LLMs. The 
case for 4-bit precision: k-bit Inference Scaling Laws shows across a variety of LLMs that 4-bit quantisation is 
universally optimal for maximizing zero-shot accuracy and reducing the number of bits used.
● Speculative decoding enables decoding multiple tokens in parallel through multiple model heads rather than 
forward passes, speeding up inference by 2-3X for certain models.
● SWARM Parallelism is a training algorithm designed for poorly connected and unreliable devices.<br>

## RetrieverQueryEngine 

- The RetrieverQueryEngine takes a retriever and a response synthesizer as inputs. 
- The retriever is responsible for fetching relevant IndexNode objects from the index, 
- The response synthesizer is used to generate a natural language response based on the retrieved nodes and the user query.



In [77]:
from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine


query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk,
    verbose=True,
    response_mode="compact"
)

## Query the indexed PDF

In [None]:
response = query_engine_chunk.query(
   "Who are the authors of this report?"
)
str(response)

In [None]:
response = query_engine_chunk.query(
   "What is new about FlashAttention?"
)
str(response)

In [None]:
response = query_engine_chunk.query(
    "Does the report mention anything about inference and latency concerns?"
)
str(response)

In [None]:
response = query_engine_chunk.query(
    "What does the report say about text to image models?"
)
str(response)

In [None]:
response = query_engine_chunk.query(
    "Summarize the research section of the report"
)
str(response)

In [None]:
response = query_engine_chunk.query(
    "What does the report say about the importance of quality prompts?"
)
str(response)

## References

1. https://docs.llamaindex.ai/en/stable/
2. https://deci.ai/blog/rag-with-llamaindex-and-decilm-a-step-by-step-tutorial/