In [None]:
from IPython.display import Image

# RAG (Retrieval-Agumentated Generation)

###**1. What is RAG?**

LLM models have been trained to a certain datetime and if you ask about recent events, they won't be able to answer that. So, we needed a solution that makes the model to take the question(prompt) and look for the answer in different sort of resources such as Documents, Emails, Databases, or Search engine likes Google. This process of retrieving information from other sources and bring back to the LLM model to answer (Generate) is called RAG.

The benefits of using RAG is beyond helping the model to tackle **Hallucination**, it actually ensures that information **remains up-to-date** which is very important in many tasks.

###**2. Where is RAG in LLM Workflow?**
Here I am assuming that we are using Azure AI Search/Azure OpenAI services to use the LLM models. Of course, you can build your own custom RAG, but still the diagram would be valid and shows how Data Sources will be used in any RAG-based senarios.


In [None]:
Image(url="https://learn.microsoft.com/en-us/azure/search/media/retrieval-augmented-generation-overview/architecture-diagram.png")

*   App UX (web app) for the user experience
*   App server or orchestrator (integration and coordination layer)
*   Azure AI Search (information retrieval system)
*   Azure OpenAI (LLM for generative AI)


###**3. How does RAG work?**

Naturaly, the next question would be, how the RAG retrieves answers? To answer that question, we need to know two more concepts: **Index** and **Embedding Vectors**. Because these are the two core concepts of any RAG.

Let's start with Index:
RAG uses your data to generate answers to the user question. For RAG to work well, we need to find a way to ***search*** and send your data in an easy and cost efficient manner to the LLMs. This is achieved by using an index.

**An index** is a data store that allows you to search data efficiently. This index is very useful in RAG. An index can be optimized for LLMs by creating **vectors** (text data converted to number sequences using an embedding model). A good index usually has efficient search capabilities like keyword searches, semantic searches, vector searches, or a combination of these. This optimized RAG pattern can be illustrated as follows.

In [None]:
Image(url="https://learn.microsoft.com/en-us/azure/ai-foundry/media/index-retrieve/rag-pattern.png")

In [None]:
Image(url="https://learn.microsoft.com/en-us/azure/ai-foundry/media/index-retrieve/rag-pattern-with-index.png")

Some services like **Azure AI** provides an **index** asset to use with RAG pattern. The index asset contains important information like where is your index stored, how to access your index, what are the modes in which your index can be searched, does your index have vectors, what is the embedding model used for vectors etc.

**A Standard RAG**

A Standrard RAG works like this:

1. First, it breaks down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens;
2. Then, it uses an embedding model to convert these chunks into vector embeddings that encode meaning;
3. and finally, it stores these embeddings in a vector database that allows for searching by semantic similarity.

At runtime, when a user inputs a query to the model, the **vector database** is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model. While embedding models excel at capturing semantic relationships, they can miss crucial exact matches. Fortunately, there’s an older technique that can assist in these situations. **BM25 (Best Matching 25)** is a ranking function that uses lexical matching to find precise word or phrase matches. It's particularly effective for queries that include unique identifiers or technical terms. BM25 works by building upon the **TF-IDF (Term Frequency-Inverse Document Frequency)** concept. TF-IDF measures how important a word is to a document in a collection. BM25 refines this by considering document length and applying a saturation function to term frequency, which helps prevent common words from dominating the results.

Here’s how BM25 can succeed where semantic embeddings fail: Suppose a user queries "Error code TS-999" in a technical support database. An embedding model might find content about error codes in general, but could miss the exact "TS-999" match. BM25 looks for this specific text string to identify the relevant documentation.

RAG solutions can more accurately retrieve the most applicable chunks by combining the embeddings and BM25 techniques using the following steps:

**A Standard RAG using BM25**

1. Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens;
2. Create TF-IDF encodings and semantic embeddings for these chunks;
3. Use BM25 to find top chunks based on exact matches;
4. Use embeddings to find top chunks based on semantic similarity;
5. Combine and deduplicate results from (3) and (4) using rank fusion techniques;
6. Add the top-K chunks to the prompt to generate the response.

By leveraging both BM25 and embedding models, traditional RAG systems can provide more comprehensive and accurate results, balancing precise term matching with broader semantic understanding.

In [None]:
Image(url="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F45603646e979c62349ce27744a940abf30200d57-3840x2160.png&w=3840&q=75", width=1000)

This approach allows you to cost-effectively scale to enormous knowledge bases, far beyond what could fit in a single prompt. But these traditional RAG systems have a significant limitation: **they often destroy context**.


**Example of RAG failing because of lack of Context:**

Imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"

A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.

### **4. Optimizating RAG**

1. Sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can just include the entire knowledge base in the prompt that you give the model, with no need for RAG or similar methods. **I think Llama 4 is doing it now!**

2. Another way to speed up the retrieval process is **prompt cashing**, which makes this approach significantly faster and more cost-effective.

3. as your knowledge base grows, you'll need a more scalable solution. That’s where **Contextual Retrieval** comes in.

###**5. How to implement RAGs?**

Common Framework to implement the RAG are **LangChain**, **LlamaIndex** and **Semantic Kernel**.

**LangChain**

angChain is a framework for developing applications powered by large language models (LLMs). LangChain implements a standard interface for large language models and related technologies, such as embedding models and vector stores, and integrates with hundreds of providers.

There are five components in LangChain: **Models**, **Prompts**, **Indexes**, **Chains**, and **Agents**.

In [None]:
Image(url="https://python.langchain.com/svg/langchain_stack_112024.svg", width=1000)

In [5]:
# Download a quantized Mistral model (Q4_K_M)
!wget -O mistral-7b-instruct.Q4_K_M.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf


--2025-04-07 12:34:54--  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
Resolving huggingface.co (huggingface.co)... 18.239.50.49, 18.239.50.16, 18.239.50.80, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.49|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/46/12/46124cd8d4788fd8e0879883abfc473f247664b987955cc98a08658f7df6b826/14466f9d658bf4a79f96c3f3f22759707c291cac4e62fea625e80c7d32169991?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27mistral-7b-instruct-v0.1.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-instruct-v0.1.Q4_K_M.gguf%22%3B&Expires=1744031999&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NDAzMTk5OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Ni8xMi80NjEyNGNkOGQ0Nzg4ZmQ4ZTA4Nzk4ODNhYmZjNDczZjI0NzY2NGI5ODc5NTVjYzk4YTA4NjU4ZjdkZjZiODI2LzE0NDY2ZjlkNjU4YmY0YTc5Zjk2YzNmM2YyMjc1O

In [4]:
!pip install -q langchain langchain-community langchain_huggingface langchainhub llama-cpp-python chromadb sentence-transformers

In [6]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA

In [8]:
# Make sure to upload your file on Google Colab before running this code!
loader = TextLoader("cleaned_example.txt")
documents = loader.load()

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

In [10]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = Chroma.from_documents(docs, embedding_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
# k specifies the number of closest (most relevant) documents to return
retriever = db.as_retriever(search_kwargs={"k": 3})

In [12]:
llm = LlamaCpp(
    model_path="mistral-7b-instruct.Q4_K_M.gguf",  # Update this path to your downloaded model
    n_ctx=2048,
    temperature=0.7,
    top_p=0.9,
    verbose=True,
    n_threads=4  # Set depending on your CPU
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from mistral-7b-instruct.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention

In [13]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

In [14]:
query = "Summarize the story in 500 words and write in Simple English."
result = qa_chain.invoke(query)

llama_perf_context_print:        load time =  194461.98 ms
llama_perf_context_print: prompt eval time =  194458.91 ms /   423 tokens (  459.71 ms per token,     2.18 tokens per second)
llama_perf_context_print:        eval time =   40233.57 ms /    67 runs   (  600.50 ms per token,     1.67 tokens per second)
llama_perf_context_print:       total time =  234775.22 ms /   490 tokens


In [15]:
print("\n=== Answer ===\n")
print(result["result"])


=== Answer ===

 The story is about a person who is on a journey to find something. The path is dark and the traveler is carrying both hope and sorrow. Along the way, the traveler's footsteps are echoing through time, drawing closer to the truth that is buried beneath silence. Something powerful is watching from the distance.


In [16]:
print("\n=== Sources ===\n")
for doc in result["source_documents"]:
    print(doc.metadata)


=== Sources ===

{'source': 'cleaned_example.txt'}
{'source': 'cleaned_example.txt'}
{'source': 'cleaned_example.txt'}


###**Resources**

1. https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview
2.  https://www.anthropic.com/news/contextual-retrieval
4.  https://python.langchain.com/docs/introduction/
5. https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/
6.   https://www.deeplearning.ai/short-courses/langchain-chat-with-your-data/
7.   https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/
8.   https://www.udemy.com/course/langchain/?couponCode=24T1MT310325G2


