Authored by: Aryan Mistry

# Minimal Retrieval-Augmented Generation (RAG) Pipeline

Large language models are excellent at pattern matching but they can **hallucinate** by making up facts that aren't grounded in reality. One way to reduce hallucinations is to **augment the model with retrieval**: instead of relying solely on its internal parameters, the model retrieves relevant passages from a knowledge base and uses them to inform its answer.

This notebook walks you through a minimal Retrieval‑Augmented Generation (RAG) pipeline. We'll start with a tiny corpus, break it into chunks, embed those chunks so we can compare them to a question, retrieve the most relevant pieces, and finally generate an answer (without using a real model). Along the way we'll highlight why each step matters and how libraries like LangChain can help you build production systems. [9]

## 1. Sample Corpus

We'll work with a small corpus of short articles about various topics. In a real
application you would index thousands of documents, but this toy corpus is
sufficient to illustrate the retrieval process.

Run the cell below to define the corpus and inspect a few entries. Each string
represents a document.


In [1]:

# Define a small corpus of documents
corpus = [
    "The Apollo 11 mission in 1969 marked the first time humans walked on the Moon. "
    "Neil Armstrong and Buzz Aldrin spent hours exploring the lunar surface while Michael Collins orbited above.",

    "Global climate change is primarily driven by human activities such as burning fossil fuels. "
    "Rising concentrations of greenhouse gases are warming the atmosphere and oceans, leading to melting ice caps.",

    "Artificial intelligence (AI) refers to computer systems that can perform tasks normally requiring human intelligence. "
    "Machine learning is a subset of AI in which algorithms improve through experience.",

    "The global economy is interconnected. Events in one country, like interest rate changes or trade policies, "
    "can ripple through financial markets around the world.",

    "Photosynthesis is the process by which green plants convert sunlight into chemical energy. "
    "Chlorophyll in plant cells captures light energy to fuel this conversion."
]

# Preview the corpus
for i, doc in enumerate(corpus):
    print(f"Document {i}: {doc[:100]}...")


Document 0: The Apollo 11 mission in 1969 marked the first time humans walked on the Moon. Neil Armstrong and Bu...
Document 1: Global climate change is primarily driven by human activities such as burning fossil fuels. Rising c...
Document 2: Artificial intelligence (AI) refers to computer systems that can perform tasks normally requiring hu...
Document 3: The global economy is interconnected. Events in one country, like interest rate changes or trade pol...
Document 4: Photosynthesis is the process by which green plants convert sunlight into chemical energy. Chlorophy...


## 2. Chunking

Language models have a limited context window: they can't ingest arbitrarily
long documents at once. To make retrieval efficient and to ensure relevant
snippets are available, we split each document into *chunks*. Here we'll
implement a simple chunker that splits on full stops and groups sentences
into pairs. In a production system you might use more sophisticated methods
(e.g. sentence tokenisers, fixed token counts or overlap between chunks).

Feel free to experiment with the `max_sentences` parameter in the exercises
below. [9]


In [2]:

import re

# Split a document into chunks containing up to `max_sentences` sentences.
# This simple splitter looks for periods and groups sentences into pairs.
def chunk_document(doc: str, max_sentences: int = 2) -> list:
    """Split a document into chunks of up to `max_sentences` sentences.

    This simplistic splitter uses periods to identify sentence boundaries. In practice you might use NLP libraries (e.g. spaCy) for more robust splitting.
    """
    sentences = [s.strip() for s in re.split(r"\.(?!\d)", doc) if s.strip()]
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = '. '.join(sentences[i:i + max_sentences]).strip()
        chunks.append(chunk)
    return chunks

# Build the list of chunks along with their document IDs
chunks = []
for doc_id, doc in enumerate(corpus):
    for chunk in chunk_document(doc, max_sentences=2):
        chunks.append((doc_id, chunk))

print(f"Created {len(chunks)} chunks from {len(corpus)} documents.")
for i, (doc_id, chunk) in enumerate(chunks[:5]):
    print(f"Chunk {i} (doc {doc_id}): {chunk[:80]}...")


Created 5 chunks from 5 documents.
Chunk 0 (doc 0): The Apollo 11 mission in 1969 marked the first time humans walked on the Moon. N...
Chunk 1 (doc 1): Global climate change is primarily driven by human activities such as burning fo...
Chunk 2 (doc 2): Artificial intelligence (AI) refers to computer systems that can perform tasks n...
Chunk 3 (doc 3): The global economy is interconnected. Events in one country, like interest rate ...
Chunk 4 (doc 4): Photosynthesis is the process by which green plants convert sunlight into chemic...


## 3. Embedding with TF-IDF

To compare text snippets numerically we need vector representations. A simple
and widely used technique is **TF-IDF** (term frequency–inverse document
frequency). Each chunk is represented by a vector of term counts scaled by
how unique the term is across the corpus.

While modern RAG systems use dense neural embeddings (e.g. from BERT or
OpenAI models) to capture semantic similarity, TF-IDF is easy to compute
and works surprisingly well on small corpora.

We'll build a vocabulary, compute document frequencies, and then create a
TF-IDF vector for each chunk. [7]


In [3]:

import math
from collections import Counter, defaultdict

# Build a vocabulary and document frequency dictionary
def build_vocabulary(chunks):
    """Construct the vocabulary and document frequency counts for a list of chunks.

    Returns a set of all unique terms and a dictionary mapping each term to the number of chunks it appears in.
    """
    df = defaultdict(int)
    vocab = set()
    for _, chunk in chunks:
        words = chunk.lower().split()
        unique_words = set(words)
        for word in unique_words:
            df[word] += 1
        vocab.update(words)
    return sorted(vocab), df

vocab, df = build_vocabulary(chunks)

# Compute a TF-IDF vector for a single chunk
def compute_tfidf(chunk: str, vocab: list, df: dict, n_docs: int) -> list:
    """Compute a TF‑IDF vector for a given chunk.

    The TF term weights are raw counts and the IDF uses `df` to down‑weight common words.
    """
    words = chunk.lower().split()
    tf = Counter(words)
    vec = []
    for term in vocab:
        tf_val = tf.get(term, 0)
        idf = math.log((n_docs + 1) / (df.get(term, 0) + 1)) + 1
        vec.append(tf_val * idf)
    return vec

# Precompute all chunk vectors
n_chunks = len(chunks)
chunk_vectors = [compute_tfidf(text, vocab, df, n_chunks) for _, text in chunks]
print(f"Computed TF-IDF vectors for {n_chunks} chunks and {len(vocab)} terms.")


Computed TF-IDF vectors for 5 chunks and 114 terms.


## 4. Retrieval with Cosine Similarity

With vector representations in hand, the next step is to find which chunks
are most relevant to a user's query. We'll convert the query into a TF-IDF
vector (using the same vocabulary and IDF values) and compute the **cosine
similarity** between the query and each chunk. Cosine similarity measures
angular similarity between vectors and ranges from –1 (opposite) to 1
(identical). We then pick the top-`k` chunks with the highest similarity.

Feel free to adjust `top_k` to see how the retrieved context changes. [9]


In [4]:

# Compute cosine similarity between two vectors
def cosine_similarity(vec1: list, vec2: list) -> float:
    """Compute cosine similarity between two numeric vectors.

    Returns a value between 0 and 1 indicating how similar the two vectors are.
    """
    dot = sum(a * b for a, b in zip(vec1, vec2))
    norm1 = math.sqrt(sum(a * a for a in vec1))
    norm2 = math.sqrt(sum(b * b for b in vec2))
    return dot / (norm1 * norm2 + 1e-8)

# Search for top-k similar chunks
def search(query: str, chunks, chunk_vectors, vocab, df, n_docs, top_k: int = 2):
    q_vec = compute_tfidf(query, vocab, df, n_docs)
    sims = [cosine_similarity(q_vec, c_vec) for c_vec in chunk_vectors]
    ranked = sorted(range(len(sims)), key=lambda i: sims[i], reverse=True)
    return ranked[:top_k], [sims[i] for i in ranked[:top_k]]

# Test retrieval on a sample query
query = "How do plants convert sunlight into energy?"
top_indices, similarities = search(query, chunks, chunk_vectors, vocab, df, n_chunks, top_k=3)
print(f"Query: {query}")
for rank, (idx, score) in enumerate(zip(top_indices, similarities), 1):
    doc_id, text = chunks[idx]
    print(f"Rank {rank} (score={score:.3f}): Document {doc_id}, chunk: {text}")


Query: How do plants convert sunlight into energy?
Rank 1 (score=0.440): Document 4, chunk: Photosynthesis is the process by which green plants convert sunlight into chemical energy. Chlorophyll in plant cells captures light energy to fuel this conversion
Rank 2 (score=0.000): Document 0, chunk: The Apollo 11 mission in 1969 marked the first time humans walked on the Moon. Neil Armstrong and Buzz Aldrin spent hours exploring the lunar surface while Michael Collins orbited above
Rank 3 (score=0.000): Document 1, chunk: Global climate change is primarily driven by human activities such as burning fossil fuels. Rising concentrations of greenhouse gases are warming the atmosphere and oceans, leading to melting ice caps


## 5. Generation

Once you've retrieved relevant context, a language model would normally read
both the query and the context and produce a natural language answer. In our
minimal example we'll simulate this by simply echoing the most relevant chunk
back to the user. This may feel unsatisfying, but it highlights the role of
retrieval: it surfaces the raw information on which a model could base its
answer. [9]


In [5]:

# Generate an answer from the top retrieved context
def generate_answer(query: str, top_indices: list, chunks: list) -> str:
    """Generate a naive answer by concatenating retrieved chunks.

    In a real RAG system this function would call a language model, passing the query and retrieved context to generate a concise answer.
    """
    if not top_indices:
        return "I'm sorry, I couldn't find any relevant information to answer your question."
    doc_id, context = chunks[top_indices[0]]
    return f"Based on document {doc_id}, here's some relevant information: {context}"

answer = generate_answer(query, top_indices, chunks)
print(answer)


Based on document 4, here's some relevant information: Photosynthesis is the process by which green plants convert sunlight into chemical energy. Chlorophyll in plant cells captures light energy to fuel this conversion


## 6. Towards Production: Using Libraries like LangChain

Our implementation above exposes each part of the RAG pipeline. Frameworks
like **LangChain**, **LlamaIndex** and others provide higher-level abstractions
that handle chunking, embedding, retrieval and calling the language model
for you. Here's what using a RAG pipeline might look like with such a
framework (example code only—this cell is not executable here):

```python
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# Load and chunk documents
loader = TextLoader('my_docs/*.txt')
raw_docs = loader.load()

# Create an embedding model and vector store
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(raw_docs, embeddings)

# Build a retrieval-augmented QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

# Ask questions
response = qa_chain.run('How do plants convert sunlight into energy?')
print(response)
```

These libraries take care of details like pagination, token limits and
embedding storage. Understanding the internals, however, helps you debug and
customise these systems. [11]


## 7. Exercises

To deepen your understanding, try the following exercises:

1. **Vary the chunk size.** Change `max_sentences` in the chunking function to 1 or 3 and observe how retrieval results differ. Does a smaller or larger chunk size improve relevance?
2. **Top-`k` tuning.** Adjust `top_k` in the `search` function. What happens when you retrieve more than one chunk? How could you combine information from multiple chunks?
3. **Alternate similarity measures.** Modify `cosine_similarity` to use Euclidean distance or Jaccard similarity. Which works better on this corpus?
4. **Extend generation.** Instead of simply returning the top chunk, write a function that concatenates the top-`k` chunks and summarises them in your own words.
5. **Use a different corpus.** Replace the sample corpus with your own documents (e.g. lecture notes or articles) and build a mini RAG system for a new domain.

Feel free to add cells below this one for your solutions. Use Markdown to describe your thought process and code cells to implement your answers.


6. **Implement Jaccard similarity retrieval.** Instead of TF‑IDF and cosine similarity, compute the Jaccard similarity between the set of words in the query and each chunk. Compare the retrieval results.

HINT: Computing Jaccard Similarity

To measure how similar two sets of words (e.g. query tokens and document tokens) are,  
you can use the **Jaccard similarity** formula:

$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

Where:

* \(A\) = set of unique tokens in text A  
* \(B\) = set of unique tokens in text B  
* \(A ∩ B\) = tokens present in **both** A and B  
* \(A ∪ B\) = tokens present in **either** A or B  

In Python terms:

```python
intersection = len(set_a & set_b)
union = len(set_a | set_b)
similarity = intersection / union
```

7. **Use scikit‑learn's TF‑IDF.** Explore using `sklearn.feature_extraction.text.TfidfVectorizer` to compute embeddings. How do the results differ from our manual implementation?
8. **Expand the corpus.** Add more documents to the corpus and observe how the retrieval quality changes.

Foundational LLMs & Transformers
1. Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems (NIPS 2017).
2. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
4. OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
5. Touvron, H., et al. (2023). LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.

Generative AI & Sampling

6. Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
7. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
8. Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, University of Toronto.

Retrieval-Augmented Generation (RAG) & Knowledge Grounding

9. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS 2020.
10. deepset ai (2023). Haystack: Open-Source Framework for Search and RAG Applications. https://haystack.deepset.ai
11. LangChain (2023). LangChain Documentation and Cookbook. https://python.langchain.com

Evaluation & Safety

12. Papineni, K., et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL 2002.
13. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop 2004.
14. OpenAI (2024). Evaluating Model Outputs: Faithfulness and Grounding. OpenAI Docs.
15. Guardrails AI (2024). Open-Source Guardrails Framework. https://github.com/shreyar/guardrails

Prompt Engineering & Instruction Tuning

16. White, J. (2023). The Prompting Guide. https://www.promptingguide.ai
17. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.

Agents & Tool Use

18. Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
19. LangChain (2024). LangChain Agents and Tools Documentation.
20. Microsoft (2023). Semantic Kernel Developer Guide. https://learn.microsoft.com/en-us/semantic-kernel/
21. Google DeepMind (2024). Gemini Technical Report. arXiv:2312.11805.

State, Memory & Orchestration

22. LangGraph (2024). Stateful Agent Orchestration Framework. https://langchain-langgraph.vercel.app
23. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442.

Pedagogical and Course Design References

24. fast.ai (2023). fast.ai Deep Learning Course Notebooks. https://course.fast.ai
25. Ng, A. (2023). DeepLearning.AI Short Courses on Generative AI.
26. MIT 6.S191, Stanford CS324, UC Berkeley CS294-158. (2022–2024). Course Materials and Public Notebooks for ML and LLMs.