# Your first RAG application (Retrieval-Augmented Generation)

In this notebook, you'll build a simple **RAG (Retrieval-Augmented Generation)** pipeline — a powerful technique that combines the strength of large language models with external knowledge sources.

RAG allows the model to **retrieve** relevant information from documents and **generate** accurate, **grounded** responses, instead of relying on the internal knowledge of the LLM.

### 🔍  Why RAG?
- Enhances model responses with up-to-date or domain-specific knowledge.
- Reduces hallucinations by grounding answers in real data.

### Prepare all the necessary libs: Vector DB, and language model servers

In [None]:
# Install vector DB. If run session needs to be re-initialised, do so and second install should only check it's already available. 
!pip install chromadb

In [None]:
# Install ollama to serve the models 
!curl https://ollama.ai/install.sh | sh
!ollama --version

In [None]:
# Start ollama serve in the background using nohup and &
!nohup ollama serve > /dev/null 2>&1 &

In [None]:
# Obtain your embedding & LLM models (execute here or go to terminal)
!ollama pull mxbai-embed-large
!ollama pull mistral  

In [None]:
# Check model availability 
!ollama list

In [None]:
# Finally, install the python library
!pip install ollama

### Import libs and start running

In [2]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import rich
from IPython.display import Image, display

import ollama 

In [None]:
# First run 
res = ollama.chat(model="mistral", 
            messages=[ {"role": "user", "content": "Tell me a joke about Data Science"}]
           )
rich.print(res)

# Build a naive RAG pipeline

The main components of a naive RAG pipeline are: 

1. Will use an open-source embedding function
2. ChromaDB to store all the generated documents and index
3. Dense retrieval (semantic search) over the index to retrieve the most relevant content
4. Augment the query with the retrieved context
5. Generate an answer to the query, using the retrieved context  

In [6]:
url = "https://raw.githubusercontent.com/marc-olm/genai101/main/docs/images/rag_diagram.png"
display(Image(url=url, width=600))

## Set up your first RAG pipeline

The index is created with all the (chunked) documents + the embedding function 

In [7]:
url = "https://raw.githubusercontent.com/marc-olm/genai101/main/docs/images/indexing.png"
display(Image(url=url, width=700))

In [None]:
import os
import chromadb

In [None]:
# === Step 1: Setup ChromaDB ===
chroma_client = chromadb.Client()
collection    = chroma_client.get_or_create_collection(name="rag-docs")

In [None]:
# === Step 2: Load and Embed Documents ===
def embed_text(text):
    response = ollama.embed(model="mxbai-embed-large", input=text)
    return response["embeddings"][0]

In [None]:
# Sample docs (could also read from files)
documents = [
    "Jurgen Klopp was born in Germany in 1974. He has been a successful coach in the UK",
    "You can contact Sky customer support through the help portal or live chat.",
    "An apple a day keeps the doctor away"
]

In [None]:
for i, doc in tqdm(enumerate(documents)):
    embedding = embed_text(doc)
    collection.add( documents=[doc],
                    embeddings=[embedding],
                    ids=[f"doc-{i}"]
                    )

In [None]:
# === Step 3: Accept User Query and Retrieve Relevant Docs ===
query = "What team did Jurgen Klopp coach?"

query_embedding = embed_text(query)
results = collection.query(query_embeddings=[query_embedding], n_results=3)

rich.print(results) 

In [None]:
retrieved_docs = results["documents"][0]
context = "\n".join(retrieved_docs)

In [None]:
# === Step 4: Run RAG Prompt through Ollama LLM ===
answer_prompt = """You're a personal assistant. Your task is to answer questions using only the provided context. 
If you can not explicitly extract the answer from the context, your answer must be I cannot help with that. 

Your answer must be direct and contain no more than 150 words.

Question: {query}

<context start> 
{context}
</context end>

Answer:"""

rich.print(answer_prompt)

In [None]:
query = 'What team did Jurgen Klopp coach?'

In [None]:
res = ollama.chat(model="mistral", 
            messages=[ {"role": "user", "content": answer_prompt.format(context=context, query=query)}]
           )
rich.print(res)

## Build a proper index 

1. Take large documents and chunk them if needed
2. Add relevant metadata to the documents to enhance search
3. Embed and add to the collection 

In [None]:
print("Downloading Shakespeare dataset...")

output_path = 'shakespeare.txt' 
url         = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

!curl -L -o {output_path} {url}  

In [None]:
# 40,000 lines of Shakespeare from a variety of Shakespeare's plays
with open(output_path, 'r', encoding='utf-8') as f:
    text = f.read()

text = text[:100000]

In [None]:
# Naive splitter 
def chunk_text(text, chunk_size, overlap=20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # move back by `overlap` characters
    return chunks

chunks = chunk_text( text, chunk_size=2000, overlap=200 )
print( f'The split method produced {len(chunks)} chunks' )

In [None]:
shakespeare_collection = chroma_client.get_or_create_collection(name="shakespeare-chunks")

i=0
for chunk in tqdm(chunks): 
    embedding = embed_text(chunk)
    shakespeare_collection.add( documents=[chunk],
                    embeddings=[embedding],
                    ids=[f"{i}"]
                    )
    i+=1

## Generate a valid set of questions so I can evaluate the Retrieval & Generation

Idea is simple: by extracting the questions from passages:
- I can keep track of the origin 
- I can set up an evaluation method for retrieval efficacy. 

In [None]:
summary_prompt = """You are an expert Shakespeare analyst. You will receive a chunk of one of his books, and your task is to summarise what is happening in the passage.
Write a short summary capturing the most relevant information of the passage in less than 100 words. 

Chunk: {chunks}

Answer: 
"""

question_prompt = """You are an expert Shakespeare analyst. You will receive a summary of a passage of one of his books. 
Your task is to generate ONE simple, short, fact-based question that can be answered with the provided text alone. 

You must not mention that you are extracting the question from a text, a passage or a chunk.

Text: {summary}

Question: 
"""

rich.print(f' "SUMMARY_PROMPT" = {summary_prompt}')
rich.print('------------')
rich.print(f' "QUESTION_PROMPT" = {question_prompt}')

In [None]:
def extract_question(chunks, chunk_id, verbose=False): 

    # Extract a summary from the provided passage
    summary = ollama.chat(model="mistral", messages=[
    {"role": "user", "content": summary_prompt.format(chunks=chunks[chunk_id])}
    ])

    if verbose:
        rich.print( f'Summary from chunk {chunk_id}: {summary["message"]["content"]}')

    # Extract question from the generated summary 
    question = ollama.chat(model="mistral", messages=[
    {"role": "user", "content": question_prompt.format(summary=summary["message"]["content"])}
    ])

    question = question["message"]["content"]
    rich.print( f'Question from chunk {chunk_id}: {question}')

    return question 

In [None]:
question = extract_question( chunks, 1, verbose=True )

In [None]:
chunk_ids = [1,20,35,42,48]
questions = []

for chunk_id in tqdm(chunk_ids):
    question = extract_question( chunks, chunk_id )
    questions.append(question)

question_set = pd.DataFrame( {'chunk_id':chunk_ids, 'question': questions} )

In [None]:
# Save the generated questions
question_set.to_csv('rag_question_set.csv', index=False)

In [None]:
question_set_path = 'rag_question_set.csv'
question_set_url  = 'https://raw.githubusercontent.com/marc-olm/genai101/main/notebooks/rag_question_set.csv'

!curl -L -o {question_set_path} {question_set_url}

In [None]:
question_set = pd.read_csv(question_set_path)

# Retrieval 

In [None]:
k = 20 

def find_position(lst, value):
    try:
        return lst.index(value)
    except ValueError:
        return np.nan

In [None]:
found = []
retrieved_chunks = []
for idx in tqdm(question_set.index):
    q   = question_set.at[idx, 'question'] 
    cid = question_set.at[idx, 'chunk_id'] 

    query_embedding = embed_text(q)
    results         = shakespeare_collection.query(query_embeddings=[query_embedding], n_results=k)

    retrieved_chunks.append( results['ids'][0] )
    found.append( find_position( results['ids'][0], str(cid) ) )

question_set['retrieval_rank']   = found 
question_set['retrieved_chunks'] = retrieved_chunks

In [None]:
# Retrieval results 
question_set['question']

In [None]:
import matplotlib.pyplot as plt

ranks     = range(0,k)
precision = [ (question_set['retrieval_rank']<=rk).mean() for rk in ranks ]

plt.title( 'Precision @k' )
plt.plot( ranks, precision )
plt.xlabel( 'k' )
plt.ylabel( 'Precision' )

# Augment & Generate

In [None]:
answer_prompt

In [None]:
top_k     = 5
synthesis = []

for idx in tqdm(question_set.index):

    question = question_set.at[idx, 'question']
    context  = "\n".join( [ chunks[int(cid)] for cid in question_set.at[idx, 'retrieved_chunks'][:top_k] ] )

    response = ollama.chat(model="mistral", messages=[
    {"role": "user", "content": answer_prompt.format( context=context, query=question ) }
])
    answer = response["message"]["content"]
    rich.print( f' "Query": {question} \n "Answer": {answer}' )
    synthesis.append( answer )

question_set['Responses'] = synthesis 