# ðŸ““ The GenAI Revolution Cookbook

**Title:** 25 RAG Architectures Explained â€” Pick the Right One for Your App

**Description:** Quickly choose the right RAG architecture for your use case: reduce hallucinations, cut latency, and ship with project-ready examples today.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Retrieval-Augmented Generation (RAG) is a powerful technique for grounding LLM responses in external knowledge. This guide walks you through a minimal, runnable baseline RAG pipeline using LangChain, then demonstrates how to extend it with corrective, speculative, and agentic patternsâ€”all executable in Google Colab with a single dataset.

## Why Use LangChain for RAG

LangChain provides a unified interface for building RAG pipelines, abstracting retriever setup, prompt templating, and chain orchestration. For AI Builders, this means faster iteration on retrieval strategies, easier integration of multiple LLMs, and modular components that can be swapped or extended without rewriting core logic.

## Core Concepts for This Use Case

- **Retriever**: Fetches relevant document chunks from a vector store based on semantic similarity.
- **Chain**: Orchestrates retrieval and generation, passing context to the LLM.
- **Prompt Template**: Structures the input to the LLM, ensuring answers are grounded in retrieved context.
- **Evaluation**: Measures relevance, groundedness, and latency to compare RAG variants.

## Setup

Install the required packages and set your OpenAI API key. This guide uses a minimal dependency set for a single-tool tutorial.

In [None]:
!pip install -qU langchain langchain-openai langchain-community chromadb sentence-transformers

Set your OpenAI API key securely. If running in Colab, use the snippet below to prompt for the key at runtime.

In [None]:
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Verify that the API key is set before proceeding.

In [None]:
import os

if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError(
        "Missing OPENAI_API_KEY. Please set it in the environment or use the cell above."
    )

print("API key is set. Ready to proceed.")

## Using LangChain for RAG in Practice

### Baseline RAG: Fast, Simple Retrieval for FAQs

Start with a minimal RAG pipeline. This example uses a small internal knowledge base for "Acme" and retrieves relevant chunks to answer user questions.

Prepare a minimal document and split it into chunks for retrieval.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document

docs = [Document(page_content="""
Acme Docs: Reset your password via Settings > Security. For MFA issues, contact support@acme.com.
""")]

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(chunks, embedding=embeddings)
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

print("Baseline RAG setup complete.")

Run a sample query to verify the pipeline works.

In [None]:
print(qa.run("How do I reset my password?"))

### Corrective RAG: Two-Pass Generation for Higher Precision

Corrective RAG adds a verification step to catch and correct unsupported claims in the initial answer.

Generate an initial answer, then verify and correct it using a second LLM pass.

In [None]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

verify_prompt = PromptTemplate.from_template(
    "Given the question and answer, identify unsupported claims and correct them using ONLY the context.\nContext:\n{context}\nQ:{q}\nA:{a}\nRevised answer:"
)

initial = qa.run("Explain MFA recovery steps.")
context = retriever.get_relevant_documents("Explain MFA recovery steps.")

verify = LLMChain(llm=llm, prompt=verify_prompt)
corrected = verify.run({
    "context": "\n".join([d.page_content for d in context]),
    "q": "Explain MFA recovery steps.",
    "a": initial
})

print(corrected)

### Speculative RAG: Draft with Fast Model, Verify with Strong Model

Speculative RAG uses a smaller, faster LLM for the initial draft, then refines it with a stronger model to improve factual accuracy.

Generate a draft answer with a fast model, then verify and improve it with the baseline model.

In [None]:
fast_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

draft = RetrievalQA.from_chain_type(fast_llm, retriever=retriever).run("Summarize MFA steps.")
final = qa.run(f"Improve factual accuracy of: {draft}. Keep it concise.")

print(final)

### Agentic RAG: Use an Agent to Plan, Retrieve, and Act

Agentic RAG gives the LLM access to tools, allowing it to plan multi-step queries and cite sources.

Define a tool for document search and initialize a zero-shot agent.

In [None]:
from langchain.agents import initialize_agent, Tool, AgentType

def search_docs(q):
    """Search internal docs for relevant content."""
    return "\n".join([d.page_content for d in retriever.get_relevant_documents(q)])

tools = [Tool(name="doc_search", func=search_docs, description="Search internal docs")]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False)

print(agent.run("Create a step-by-step policy summary; cite documents."))

## Run and Evaluate

Measure latency and evaluate answer quality for each RAG variant. Use a consistent test query to compare performance.

Time each variant and print the duration.

In [None]:
import time

def time_call(fn, *args, **kwargs):
    """Measure execution time of a function call."""
    start = time.time()
    output = fn(*args, **kwargs)
    duration = time.time() - start
    return output, duration

test_query = "How do I reset my password?"

variants = [
    ("Baseline RAG", lambda q: qa.run(q)),
    ("Corrective RAG", lambda q: verify.run({
        "context": "\n".join([d.page_content for d in retriever.get_relevant_documents(q)]),
        "q": q,
        "a": qa.run(q)
    })),
    ("Speculative RAG", lambda q: qa.run(f"Improve factual accuracy of: {RetrievalQA.from_chain_type(fast_llm, retriever=retriever).run(q)}. Keep it concise.")),
]

for name, fn in variants:
    output, duration = time_call(fn, test_query)
    print(f"{name}: {round(duration, 2)}s")

Evaluate answer relevance and groundedness using an LLM-as-judge.

In [None]:
def judge_answer(q, ctx, a):
    """Use LLM to rate answer relevance and groundedness."""
    prompt = (
        f"Question: {q}\nContext: {ctx}\nAnswer: {a}\n"
        "Rate 0-5 for relevance and 0-5 for groundedness in context. Respond in JSON format."
    )
    return llm.predict(prompt)

docs = retriever.get_relevant_documents(test_query)
context = "\n".join([d.page_content for d in docs])
answer = qa.run(test_query)

print(judge_answer(test_query, context, answer))

## Conclusion

You've built a baseline RAG pipeline with LangChain and extended it with corrective, speculative, and agentic patterns. Each variant trades off latency, accuracy, and complexity. Use the evaluation harness to measure these trade-offs on your own dataset and choose the pattern that best fits your use case. Next, explore reranking strategies or integrate external tools to further refine retrieval quality.