# Build first RAG system


* Data ingestion
* Indexing
* Retriever
* Response synthesizer
* Querying




# **Install the required packages**

In [1]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.12.3-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.0-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.3 (from llama-index)
  Downloading llama_index_core-0.12.3-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting 

 **Environment Variables**

In [None]:
import os
# from dotenv import load_dotenv
# load_dotenv()
#Retrieve the OpenAI API key from environment variables
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

In [15]:
# OR Specifiy your key directly here
os.environ['OPENAI_API_KEY'] = "sk-pron_ezu4LuHwpk3ktWTo98QT3BlbkFJbCFnHo5aMqu"

## **Download data**

In [6]:
!mkdir data
!wget "https://github.com/gmsahu/RAG-using-LlamaIndex-course/raw/33675a285b06b4048af6fa0221dd5fd289157a8a/transformers.pdf" -O 'data/transformers.pdf'


mkdir: cannot create directory ‘data’: File exists
--2024-12-07 07:35:23--  https://github.com/prashant9501/RAG-using-LlamaIndex-course/raw/33675a285b06b4048af6fa0221dd5fd289157a8a/transformers.pdf
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/prashant9501/RAG-using-LlamaIndex-course/33675a285b06b4048af6fa0221dd5fd289157a8a/transformers.pdf [following]
--2024-12-07 07:35:25--  https://raw.githubusercontent.com/prashant9501/RAG-using-LlamaIndex-course/33675a285b06b4048af6fa0221dd5fd289157a8a/transformers.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/octet-strea



1. Stage 1: Data ingestion

*  Data loaders.

 start by loading the data from a PDF file. For this, Here will use the SimpleDirectoryReader class from LlamaIndex.




In [7]:
from llama_index.core import SimpleDirectoryReader

In [8]:
documents = SimpleDirectoryReader(input_files=['data/transformers.pdf']).load_data()

In [7]:
len(documents)

15

In [9]:
print(documents[0])

Doc ID: 7ee1898b-4d80-4142-b43e-8879f26528bc
Text: Provided proper attribution is provided, Google hereby grants
permission to reproduce the tables and figures in this paper solely
for use in journalistic or scholarly works. Attention Is All You Need
Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google
Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com
Jakob Usz...


In [10]:
#metadata information
documents[0].extra_info

{'page_label': '1',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-12-07',
 'last_modified_date': '2024-12-07'}

In [11]:
print(documents[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

# **Embedding Model**

*   Prepare the document for the embedding and interaction with the large language model.Here used OpenAIEmbedding


In [16]:
# Embedding Model
from llama_index.embeddings.openai import OpenAIEmbedding

In [17]:
# Initialize the embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large") #'text-embedding-3-small')

# Setup for LLM

In [18]:
# LLM
from llama_index.llms.openai import OpenAI

In [19]:
# Initialize the large language model
llm = OpenAI(model= "gpt-3.5-turbo") # 'gpt-3.5-turbo'

# Stage 2: Indexing

 `VectorStoreIndex` class to create an index from the loaded documents, pass the document chunks, embedding model, and LLM to the `from_documents` method.

In [20]:
# Indexing
from llama_index.core import VectorStoreIndex

In [21]:
# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model) #, llm=llm)

# Stage 3: Retrieval
Retrieve relevent infomation based on queries.
The `as_retriever` method converts our index into a retriever, and the `retrieve` method allows us to query the index.

In [22]:
# Setting up the Index as Retriever
retriever = index.as_retriever()

In [23]:
# Retrieve information based on the query "What are Transformers?"
retrieved_nodes = retriever.retrieve("What is self attention?")

In [24]:
# Get the metadata of the first retrieved node
retrieved_nodes[0].metadata

{'page_label': '4',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-12-07',
 'last_modified_date': '2024-12-07'}

In [25]:
# Access the ID of the first retrieved node
retrieved_nodes[0].id_

'2ec4f9dd-c2df-4bf8-a167-c6c7bf74570c'

In [26]:
# Access the full node object of the first retrieved node
retrieved_nodes[0].node

TextNode(id_='2ec4f9dd-c2df-4bf8-a167-c6c7bf74570c', embedding=None, metadata={'page_label': '4', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-12-07', 'last_modified_date': '2024-12-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='fa4cee33-cb26-4517-8bd5-84bb2f116c12', node_type='4', metadata={'page_label': '4', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-12-07', 'last_modified_date': '2024-12-07'}, hash='695d7e279d9eab238de3fd1c9c5d10ccff3e5ec28e46bb45b67266481cd08401')}, metadata_template='{key}: 

In [27]:
# Access the text content of the first retrieved node
print(retrieved_nodes[0].text)

Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
the matrix of outputs as:
Attention(Q, K, V) = softmax(QKT
√dk
)V (1)
The two most commonly used attention functions are additive attention [2], a

# Stage 4: Response synthesis

use the `get_response_synthesizer` function:

In [28]:
from llama_index.core import get_response_synthesizer

In [29]:
# Initialize the response synthesizer with the LLM
response_synthesizer = get_response_synthesizer(llm=llm)

# Stage 5: Query Engine
This engine will allow us to query our indexed documents and receive synthesized responses from the LLM.

In [30]:
# Create a query engine using the index, LLM, and response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

In [31]:
# Query the LLM using the query engine
response = query_engine.query("What is self attention?")

In [32]:
# View the response from the LLM
response.response

'Self-attention is a mechanism that connects all positions in a sequence with a constant number of sequentially executed operations. It allows the model to weigh the importance of different words in the input sentence when predicting a particular word.'

This returns the synthesized answer to the query.



In [33]:
# Check the length of the response
len(response.response) # number of characters in the response

251

In [34]:
# Check the number of source nodes
len(response.source_nodes)  # list of 2 nodes

2

In [35]:
# Access the ID and metadata of the second source node
response.source_nodes[0].metadata

{'page_label': '4',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-12-07',
 'last_modified_date': '2024-12-07'}

# RAG pipeline
In this final section, we will integrate everything we have learned to create a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline. This pipeline will read documents, index them, and allow us to query the indexed data using a large language model (LLM).

Let's walk through the entire process step by step:

 First, we import the necessary libraries and load our documents from a specified directory. We use the `SimpleDirectoryReader` class from LlamaIndex to read all documents in the 'data' directory:


- The `SimpleDirectoryReader` reads the documents in the 'data' directory and stores them in the `documents` variable.

- Next, we initialize our large language model (LLM) and embedding model. For this demonstration, we assume that these models have already been initialized and are available as `llm` and `embed_model`:

- With our documents and models ready, we proceed to create an index. This index will facilitate efficient retrieval of information from our documents. Here, we use the `VectorStoreIndex` class to create an index from the loaded documents, embedding model, and LLM.

- We then set up a query engine that will allow us to query the indexed documents using natural language. The query engine is created from our index and LLM:

- Finally, we use the query engine to ask a question and receive a response from the LLM. In this example, we query the different types of Transformer models:

- The `query` method sends the question to the LLM, which retrieves relevant information from the indexed documents and synthesizes a response. The response is then printed to the console.




In [36]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load data from the specified directory
documents = SimpleDirectoryReader("data").load_data()

# Initialize LLM and embedding model (assumed to be pre-initialized)
llm = llm
embed_model = embed_model

# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

# Create a query engine from the index and LLM
query_engine = index.as_query_engine(llm=llm)

# Query the LLM and print the response
print(query_engine.query("What are the different types of Transformer Models?").response)

The different types of Transformer models include the Encoder and Decoder stacks, which consist of multiple identical layers with sub-layers for self-attention and feed-forward networks. Additionally, there are variations like the Decoder stack with an added sub-layer for multi-head attention over the output of the Encoder stack.


In [37]:
query = """If I want to generate document embeddings,
then which type of Transformer Architecture I must choose among Encoders, Decoders or Encoder-Decorder?"""

print(query_engine.query(query).response)

Encoder


In [38]:
query = "If I want to generate document embeddings, then which type of Transformer Architecture I must choose?"
print(query_engine.query(query).response)

If you want to generate document embeddings, you should choose the Encoder part of the Transformer architecture.
