# **Simple RAG Application** | Zero-To-AI
#### By Jashank Kshirsagar
#### **Connect with me on LinkedIn**: [linkedin.com/in/jashank-kshirsagar](https://www.linkedin.com/in/jashank-kshirsagar/)
This notebook shows how to build the simplest Retrieval-Augmented Generation (RAG) system.
It uses TF-IDF (a very basic text retrieval method) and Microsoft's Phi-2 model (a small language model).
Our Goal: Ask a question → find the most relevant text chunks → feed them to Phi-2 → get a grounded answer.

### **Reccomended Prerequisites:**  
- To really understand what the following script does, you must first understand what RAG is. 
- If you aren't familiar with Transformer models, I reccomend starting with Notebooks 1 and 2 in this series. 
- Come back to this script once you have read the following article:  
1. https://aws.amazon.com/what-is/retrieval-augmented-generation/

### **Required Library Installs in Terminal :**    
pip install transformers  
pip install torch  
pip install scikit-learn  
pip install re  
pip install numpy  
pip install typing  

**STEP 1: CREATE THE KNOWLEDGE BASE**  
These are the "documents" we want to use as the knowledge source. In a real project, these could be PDFs, policies, textbooks, etc.   
For simplicity, we will hardcode a small list of strings.

In [None]:

from typing import List, Tuple             # Used to describe the input/output types of functions (helps readability)
import re                                  # Python's "regular expressions" library (for splitting text into sentences)
import numpy as np                         # Numerical library for working with arrays (used in similarity sorting)

# Libraries for retrieval (finding relevant text)
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text into TF-IDF vectors (bag-of-words style)
from sklearn.metrics.pairwise import cosine_similarity        # Measures similarity between vectors (query vs chunks)

# Libraries for generation (using Phi-2 to generate answers)
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline  # Hugging Face tools to load/generate text

# Knowledge Base (our documents)
DOCS: List[str] = [
    "RAG means Retrieval-Augmented Generation. Instead of relying on the model's internal memory, we first retrieve relevant text from our documents, then pass that text to the model to ground the answer.",
    "A minimal RAG has three steps: chunk the documents into small pieces, retrieve the top-k relevant chunks for a question, and generate an answer using the chunks as context.",
    "Chunk size matters. If chunks are too small, you lose context. If they are too big, you add noise. Start with a few sentences per chunk and adjust based on results.",
    "TF-IDF is a simple way to vectorize text for retrieval. It is fast, understandable, and good enough for small demos.",
    "Always show citations for trust. Print which chunks you used so the reader can see the source of the answer."
]

**STEP 2: SPLITTING DOCUMENTS INTO CHUNKS**  
Large documents are too big to feed into a model directly. So we break them into smaller pieces ("chunks") that the retriever can search through.

In [None]:
def sentence_split(text: str) -> List[str]:  # Splits a document into individual sentences
    sents = re.split(r'(?<=[.!?])\s+', text.strip())  # Regex: split whenever there's a period/question mark/exclamation
    return [s for s in sents if s]  # Return sentences, removing any empty results

def chunk_doc(text: str, chunk_size: int = 3, overlap: int = 1) -> List[str]:
    sents = sentence_split(text)  # First, split the text into sentences
    chunks = []  # Create an empty list to hold the chunks
    i = 0  # Start from the first sentence
    while i < len(sents):  # Keep going until we reach the end of sentences
        chunk = " ".join(sents[i:i + chunk_size])  # Take "chunk_size" sentences at a time
        if chunk.strip():  # If the chunk isn't empty, keep it
            chunks.append(chunk)
        i += max(1, chunk_size - overlap)  # Move the window forward, leaving some overlap
    return chunks  # Return all chunks created from this text

def build_corpus(docs: List[str], chunk_size: int = 3, overlap: int = 1) -> Tuple[List[str], List[Tuple[int, int]]]:
    corpus, meta = [], []  # corpus = list of all chunks, meta = where each chunk came from
    for d_i, text in enumerate(docs):  # Go through each document
        chs = chunk_doc(text, chunk_size=chunk_size, overlap=overlap)  # Break into chunks
        for c_i, c in enumerate(chs):  # For each chunk...
            corpus.append(c)  # Add the chunk text to the corpus
            meta.append((d_i, c_i))  # Record (document index, chunk index)
    return corpus, meta  # Return both the chunks and the metadata

CORPUS, META = build_corpus(DOCS, chunk_size=3, overlap=1)

**STEP 3: CONVERTING CHUNKS TO VECTORS**  
We use TF-IDF for this step.  
- TF-IDF = Term Frequency - Inverse Document Frequency

It gives higher weight to words that are frequent in one document but rare across all documents.  
Example: "RAG" will get a higher score than "the" or "and".


In [None]:

VECT = TfidfVectorizer(ngram_range=(1, 2))  # (1,2) means we consider single words and 2-word phrases
CHUNK_VECS = VECT.fit_transform(CORPUS)     # Learn vocabulary + turn each chunk into a vector

**STEP 4: RETRIEVE RELEVANT CHUNKS**  
Given a question, we turn it into a vector, then find the most similar chunk vectors using cosine similarity.

In [None]:
def retrieve(query: str, k: int = 3) -> List[int]:
    q_vec = VECT.transform([query])                 # Convert the query into a TF-IDF vector
    sims = cosine_similarity(q_vec, CHUNK_VECS)[0]  # Compare query with every chunk → get similarity scores
    order = np.argsort(-sims)                       # Sort chunk indices from highest to lowest similarity
    return order[:k].tolist()                       # Return top-k most similar chunk IDs

**STEP 5: LOADING THE LLM**  
Phi-2 is a small but powerful open-source language model from Microsoft. It works on CPU for small tasks, making it a great choice even if you don't have a GPU. 


In [None]:
GEN_MODEL = "microsoft/phi-2"                                   # Model name from Hugging Face
tok = AutoTokenizer.from_pretrained(GEN_MODEL)                  # Load tokenizer (turns text → tokens → text)
lm = AutoModelForCausalLM.from_pretrained(GEN_MODEL)            # Load model weights into memory
gen = pipeline("text-generation", model=lm, tokenizer=tok)      # Build a simple text generator pipeline

**STEP 6: BUILDING THE SYSTEM PROMPT**  
The difference with RAG: instead of just asking the question, we insert the retrieved chunks into the prompt.  
That way, the model "reads" the context and uses it to give a more grounded and factually accurate answer.

In [None]:
def build_prompt(question: str, top_chunk_ids: List[int], max_chars: int = 1200) -> str:
    ctx_blocks, running = [], 0  # ctx_blocks = the retrieved chunks we’ll insert, running = track prompt size
    for cid in top_chunk_ids:  # For each retrieved chunk ID...
        block = f"[{cid}] {CORPUS[cid]}"  # Label the chunk with its ID for citation
        if running + len(block) > max_chars:  # If adding this makes the prompt too long, stop
            break
        ctx_blocks.append(block)  # Otherwise, keep the chunk
        running += len(block)     # Update the running total of characters
    context = "\n".join(ctx_blocks)  # Join all chunks into one block of text
    prompt = (
        "You are a helpful assistant. Answer the question using ONLY the context.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        "Answer:"
    )
    return prompt  # Return the final input string for the model

**STEP 7: GENERATE AN ANSWER**  
Finally! Phew.

In [None]:
def answer(question: str, k: int = 3, max_new_tokens: int = 120) -> Tuple[str, List[int]]:
    top_ids = retrieve(question, k=k)  # Step 1: retrieve top-k chunks
    prompt = build_prompt(question, top_ids)  # Step 2: build prompt with those chunks
    out = gen(prompt, max_new_tokens=max_new_tokens, do_sample=False)[0]["generated_text"]  # Step 3: generate text
    ans = out.split("Answer:", 1)[-1].strip()  # Clean output: only keep what's after "Answer:"
    return ans, top_ids  # Return both the final answer and the IDs of chunks used

**NOW GIVE IT A TRY**  

In [None]:
question = "What is RAG and why do we chunk documents?"  # Replace this with a question of your choice

if __name__ == "__main__":
    q = question  # Example question
    a, used = answer(q, k=3, max_new_tokens=120)      # Run the RAG pipeline
    print("Q:", q)                                   # Print the question
    print("\nA:", a)                                 # Print the model's answer
    print("\nCitations:")                             # Print citations for transparency
    for cid in used:                                 # For each used chunk...
        doc_id, chunk_id = META[cid]                 # Look up which document + chunk it came from
        snippet = CORPUS[cid][:120].replace("\n", " ")  # Show only the first 120 chars as a preview
        print(f" - [chunk {cid} | doc {doc_id} | part {chunk_id}] :: {snippet}...")

✅ You have now built your very first **Retrieval-Augmented Generation (RAG) system** from scratch.  
- You learned how to split documents into chunks.  
- You used TF-IDF to retrieve relevant chunks for a query.  
- You built a simple prompt and generated an answer with **Phi-2**.  
- You also saw how to display **citations** so users can trust the source of the answer.  