# Retrieval Augmented Generation (RAG) with LlamaIndex
*Using IBM Granite Models*

## Recipe Overview

Welcome to this Granite Recipe!

In this notebook you will learn to implement Retrieval Augumented Generation (RAG) using LlamaIndex orchestration framework. 

RAG is an architecture that optimizes the performance of language models by connecting it to knowledge bases. By doing so, the language models are capable of recalling factual information from the knowledge base and customizing this information to respond to the user query. 

The major components of RAG architecture are:
1. Knowledge Base - Data repository for the system
2. Retriever - A language model that gathers context from the knowledge base that is relevant to the user query
3. Generator - A language model that generates response to the augmented query that contains the user query and the context identified by the retriever
4. Integration Layer - A layer that co-ordinates and brings together the functionality of all the components

Advantages of RAG architecture include access to domain-specific information, cost efficient AI implementation/scaling, reduced risk of hallucinations, greater data security etc. Some use cases of RAG are:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Specialized chatbot: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

[![Open YouTube video](https://img.youtube.com/vi/T-D1OfcDW1M/0.jpg)](https://www.youtube.com/watch?v=T-D1OfcDW1M)

## 1. Environment Set-up

### Install dependencies

In [None]:
%pip install git+https://github.com/ibm-granite-community/utils \
    transformers \
    llama-index \
    llama-index-embeddings-huggingface \
    llama-index-vector-stores-chroma \
    wget \
    chromadb \
    llama-index-llms-replicate \
    replicate

## System Components Configuration

### Embedding Model Selection (Retriever)

Select the embedding model and the tokenizer for the architecture. The embedding model generates vector representations of the user query and knowledge base, enabling retrieval of semantically relevant context.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbedding(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

### LLM Selection (Generator)

The LLM will be the generator component that answers the user query, given the retrieved context. For this recipe, we connect to Granite 3.3 8B model using LlamaIndex-Replicate client.

You can select other Granite models from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. 

In [None]:
from llama_index.llms.replicate import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model_path = "ibm-granite/granite-3.3-8b-instruct"
get_env_var('REPLICATE_API_TOKEN')

model = Replicate(
    model=model_path
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

### Global Settings

By setting the global parameters, we ensure consistency across the notebook

In [None]:
from llama_index.core import Settings

Settings.llm = model
Settings.embed_model = embeddings_model
Settings.chunk_size = embeddings_tokenizer.max_len_single_sentence

### Vector Database Selection (Knowledge Base)

Identify the database to store and retrieve embedding vectors.
In this recipe, we select ChromaDB to store our Knowledge Base. The storage of knowledge base in the form of vectors help in efficient similarity computation and relevant context retrieval.

In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.get_or_create_collection("granite_rag_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

## Building the Vector Database

In this recipe, we take the State of the Union speech text, split it into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Download the document

Here we use President Biden's State of the Union address from March 1, 2022.

In [None]:
import os
import wget

filename = 'state_of_the_union.txt'
url = 'https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

Please note that the chunk size is set to model's context window under the Global Settings section and is implicitly passed to SentenceSplitter

In [None]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=[filename]).load_data()
sentence_splitter = SentenceSplitter(chunk_overlap=0)

nodes = sentence_splitter.get_nodes_from_documents(documents)

for idx, node in enumerate(nodes):
    node.metadata["doc_id"] = idx

print(f"{len(nodes)} text document chunks created")

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
from llama_index.core import StorageContext, VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    embed_model=embeddings_model,
    show_progress=True
)

## Retrieval from Knowledge Base

### Conduct a similarity search

Search the knowledge base for similar documents by calculating the proximity of embedded vectors of the query and the documents in the vector space. 

In [None]:
query = "What did the president say about Fortune 500 Corporations?"

retriever = index.as_retriever(similarity_top_k=3)
retrieval_results = retriever.retrieve(query)
print(f"{len(retrieval_results)} documents returned")
for i, node in enumerate(retrieval_results):
    print(f"\nDocument {i+1} :")
    print(f"\nDocument ID : {node.metadata['doc_id']}")
    print(f"\nScore {i+1} : {node.score:.2f}")
    print(f"\nText:\n {node.text}")
    print("=" * 80)

## Response Generation

### Custom RAG Query Engine

This section outlines the process of building a custom LlamaIndex query engine for RAG using Granite models. The custom query engine operates in the following steps:

1. **Document Retrieval** - The retriever identifies and fetches relevant documents based on the input query.

2. **Prompt Construction** - The Granite chat prompt template is utilized to create a system prompt, integrating both the original query and the retrieved documents.

3. **Response Generation** - An LLM (Granite in this recipe) generates a response by processing the formatted prompt, utilizing the context from the retrieved documents.

The engine returns a Response object, which contains the generated output from the LLM, with the source nodes (documents used) incorporated into the metadata.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from transformers import PreTrainedTokenizerBase
from llama_index.core.llms import LLM
from llama_index.core.base.response.schema import Response

class RAGGraniteQueryEngine(CustomQueryEngine):
    retriever: BaseRetriever
    llm: LLM
    tokenizer: PreTrainedTokenizerBase

    def custom_query(self, query_str: str):
        docs = self.retriever.retrieve(query_str)

        formatted_prompt = self.tokenizer.apply_chat_template(
            conversation=[{
                "role": "user",
                "content": query_str
            }],
            documents=[{
                "doc_id": node.metadata.get("doc_id", ""),
                "text": node.text,
            } for node in docs],
            add_generation_prompt=True,
            tokenize=False
        )

        llm_response = self.llm.complete(formatted_prompt)
        return Response(response=llm_response.text, source_nodes=docs)



retriever = index.as_retriever(similarity_top_k=3)

query_engine = RAGGraniteQueryEngine(
    retriever=retriever,
    llm=model,
    tokenizer=tokenizer
)

###  Query Engine Execution

The query is submitted to the query engine, and the resulting response is captured. This response contains both the response generated by the LLM and the relevant documents retrieved in relation to the query.

In [None]:
from ibm_granite_community.notebook_utils import wrap_text

query = "What was said about Ketanji Brown Jackson's nomination to the Supreme Court?"
answer = query_engine.query(query)

print("=== RAG Response ===")
print(wrap_text(answer.response))

The source documents that were identified as relevant context can be observed using the below code cell.

In [None]:
print("\n\n=== Source Documents ===")
for i, source_node in enumerate(answer.source_nodes):
    doc_id = source_node.metadata.get('doc_id', 'N/A')
    print(f"\nDocument {i+1} :")
    print(f"\nDocument ID : {doc_id}")
    print(f"\nScore {i+1} : {source_node.score:.2f}")
    print(f"\nText:\n {source_node.text}")
    print("=" * 80)

### Queries outside the scope of Knowledge Base

Queries beyond the scope of the knowledge base will not be answered.

In [None]:
query = "When was the last time Ferrari won the Formula 1 World Championship?"
answer = query_engine.query(query)

print("=== RAG Response ===")
print(wrap_text(answer.response))

## Conclusion

In conclusion, this recipe demonstrates the implementation of a simple RAG architecture using the LlamaIndex orchestration layer and a knowledge base stored in ChromaDB. We utilized the LlamaIndex-Replicate client for the Granite language model and the LlamaIndex-HuggingFace client for Granite tokenizers. Additionally, we explored the process of building a customized query engine and leveraging the Granite prompt template to generate responses for RAG queries.

For more recipes on RAG architectures, please refer [here](https://github.com/ibm-granite-community/granite-snack-cookbook/tree/main/recipes/RAG). You can also explore more on Agentic RAG in this [recipe](https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/AI-Agents/Agentic_RAG.ipynb).



## References
1. “What is retrieval-augmented generation?”. 2023. IBM Research Blog. https://research.ibm.com/blog/retrieval-augmented-generation-RAG.
2. "Basic Chat Template Examples". 2025. IBM Granite Documentation. https://www.ibm.com/granite/docs/models/granite/#basic-chat-template-example. 