# Self-RAG

In this notebook, we implement a self-RAG pipeline — a retrieval-augmented generation approach that introduces decision-making and quality control steps into the generation workflow. Unlike standard RAG, which always retrieves documents and generates text based on them, self-RAG adds layers of introspection:
- Should retrieval be done at all?
- Are the retrieved documents relevant?
- Is the generated answer supported by the documents?
- Is the answer useful to the user?

This method leads to more controlled, relevant, and grounded responses, especially in real-world applications where quality and reliability matter.

In [1]:
import os
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import FAISS

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

#### Load and index the document into a vector store
To answer questions based on a document, we first need to turn it into a format our language model can search. Language models on their own don’t "know" the content of a PDF unless we explicitly make it available in a structured way. That is where embedding and vector search come in.

This process builds a memory-like structure from your document that can be semantically searched.

In [2]:
# Define the path to the PDF document we want to index
path = "Understanding_Climate_Change.pdf"

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load and parse the PDF document into a list of text objects (usually one per page)
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split the document into smaller chunks, with overlap to maintain semantic continuity
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,  # Each chunk will be around 1000 characters
        chunk_overlap=chunk_overlap,  # Overlap 200 characters with the previous chunk
        length_function=len  # Use raw character count to determine length
    )
    texts = text_splitter.split_documents(documents)
    # Replace tab characters with spaces
    for doc in texts:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces

    # Convert text chunks into dense vectors using OpenAI's embedding model
    embeddings = OpenAIEmbeddings()

    # Store the resulting embeddings into a FAISS index for fast similarity search
    vectorstore = FAISS.from_documents(texts, embeddings)

    return vectorstore

# Build the vector store from the PDF
vectorstore = encode_pdf(path)

In this step, we:
- Load the PDF into memory as raw text
- Chunk the text into overlapping segments so that each chunk preserves context from its neighbors (this helps avoid cutting off important ideas mid-sentence)
- Embed each chunk using OpenAI's embeddings model, converting text into numerical vectors
- Store these vectors in a FAISS vector store, which allows for fast similarity searches based on user queries later

So later, when a user asks a question, the system can search semantically — even if the question uses different wording than the document — and fetch the most relevant chunks of text from the original PDF.

#### Initialize the language model
We will use a lightweight OpenAI model to handle reasoning, classification, and generation throughout the pipeline.

In [3]:
llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18", max_tokens=1000, temperature=0)

We rely on this model not just for generating answers, but also for making structured decisions (like whether to retrieve or how good an answer is). Keeping temperature at 0 ensures deterministic outputs.

### Designing structured prompts for the self-RAG pipeline
Self-RAG is not just about pulling information and generating text. It is a structured, self-reflective reasoning system — and to support that, we need a set of prompts that serve specific roles in the pipeline.

Each stage in the process — whether it is deciding if we even need to retrieve documents, checking whether retrieved content is relevant, generating a response, or evaluating how strong and useful that response is — requires a different kind of instruction to the language model.

This is where prompt templates come in. We carefully craft prompts that tell the model exactly what its job is at each step. But instead of treating the output like a blob of text, we define schemas (using `pydantic` models) to enforce structure — this makes the output reliable and machine-readable.

In [4]:
### Define specialized prompts and structured response schemas for each reasoning stage ###

# ---- Schema: Retrieval decision ----
# Decide if retrieval is needed
class RetrievalResponse(BaseModel):
    response: str = Field(..., title="Determines if retrieval is necessary", description="Output only 'Yes' or 'No'.")

retrieval_prompt = PromptTemplate(
    input_variables=["query"],
    template="Given the query '{query}', determine if retrieval is necessary. Output only 'Yes' or 'No'."
)


# ---- Schema: Relevance check ----
# Determines if a retrieved chunk is relevant to the user's query
class RelevanceResponse(BaseModel):
    response: str = Field(..., title="Determines if context is relevant", description="Output only 'Relevant' or 'Irrelevant'.")

relevance_prompt = PromptTemplate(
    input_variables=["query", "context"],
    template="Given the query '{query}' and the context '{context}', determine if the context is relevant. Output only 'Relevant' or 'Irrelevant'."
)


# ---- Schema: Response generation ----
# Generates an actual answer given the context and query
class GenerationResponse(BaseModel):
    response: str = Field(..., title="Generated response", description="The generated response.")

generation_prompt = PromptTemplate(
    input_variables=["query", "context"],
    template="Given the query '{query}' and the context '{context}', generate a response."
)


# ---- Schema: Support assessment ----
# Checks if the generated response is actually supported by the retrieved context
class SupportResponse(BaseModel):
    response: str = Field(..., title="Determines if response is supported", description="Output 'Fully supported', 'Partially supported', or 'No support'.")

support_prompt = PromptTemplate(
    input_variables=["response", "context"],
    template="Given the response '{response}' and the context '{context}', determine if the response is supported by the context. Output 'Fully supported', 'Partially supported', or 'No support'."
)


# ---- Schema: Utility rating ----
# Scores the usefulness of the response on a 1 to 5 scale
class UtilityResponse(BaseModel):
    response: int = Field(..., title="Utility rating", description="Rate the utility of the response from 1 to 5.")

utility_prompt = PromptTemplate(
    input_variables=["query", "response"],
    template="Given the query '{query}' and the response '{response}', rate the utility of the response from 1 to 5."
)


# ---- Create LLM chains for each step ----
# Each chain combines a template and a schema to ensure consistent LLM outputs
retrieval_chain = retrieval_prompt | llm.with_structured_output(RetrievalResponse)
relevance_chain = relevance_prompt | llm.with_structured_output(RelevanceResponse)
generation_chain = generation_prompt | llm.with_structured_output(GenerationResponse)
support_chain = support_prompt | llm.with_structured_output(SupportResponse)
utility_chain = utility_prompt | llm.with_structured_output(UtilityResponse)

- Each `PromptTemplate` defines how we talk to the language model: the phrasing, inputs, and instructions for each task. Each prompt is tailored to a single, specific task, making it easy to test and optimize independently.
- The corresponding `BaseModel` class (e.g., `RetrievalResponse`, `UtilityResponse`) defines the expected output format. This makes sure outputs are well-structured and easy to parse (no free-form text that we have to regex or guess).
- We combine the prompt with the model using `with_structured_output(...)`, which ensures that the model's raw response is parsed into a Python object with named fields (rather than unstructured text).
- The chaining (`prompt | llm.with_structured_output(...)`) composes the logic: first the prompt is filled with inputs, then the LLM processes it, and finally the output is parsed and validated using the schema.
- Each "chain" (like `retrieval_chain`) becomes a decision point we can invoke in the pipeline — pass in a query, get back a clean, validated result.

This gives us strong control over LLM behavior, making each component in self-RAG deterministic, interpretable, and easy to debug.


### Defining the self RAG logic flow
This is where everything comes together into an actual thinking pipeline. Based on the user query, the system dynamically determines the best course of action and selects the most trustworthy and useful response.

Each of these decisions happens through a structured reasoning chain, so the model is effectively self-regulating — questioning its own steps before answering.

In [5]:
def self_rag(query, vectorstore, top_k=3):
    print(f"\nProcessing query: {query}")

    # Step 1: Decide if we need external knowledge
    print("Step 1: Determining if retrieval is necessary...")
    input_data = {"query": query}
    retrieval_decision = retrieval_chain.invoke(input_data).response.strip().lower()
    print(f"Retrieval decision: {retrieval_decision}")

    if retrieval_decision == 'yes':
        # Step 2: Retrieve top-k similar chunks from the vector store
        print("Step 2: Retrieving relevant documents...")
        docs = vectorstore.similarity_search(query, k=top_k)
        contexts = [doc.page_content for doc in docs]
        print(f"Retrieved {len(contexts)} documents")

        # Step 3: Evaluate relevance of retrieved documents and filter out irrelevant chunks
        print("Step 3: Evaluating relevance of retrieved documents...")
        relevant_contexts = []
        for i, context in enumerate(contexts):
            input_data = {"query": query, "context": context}
            relevance = relevance_chain.invoke(input_data).response.strip().lower()
            print(f"Document {i+1} relevance: {relevance}")
            if relevance == 'relevant':
                relevant_contexts.append(context)

        print(f"Number of relevant contexts: {len(relevant_contexts)}")

        # If no relevant contexts found, generate without retrieval
        if not relevant_contexts:
            print("No relevant contexts found. Generating without retrieval...")
            input_data = {"query": query, "context": "No relevant context found."}
            return generation_chain.invoke(input_data).response

        # Step 4: Generate response using relevant contexts
        print("Step 4: Generating responses using relevant contexts...")
        responses = []
        for i, context in enumerate(relevant_contexts):
            print(f"Generating response for context {i+1}...")
            input_data = {"query": query, "context": context}
            response = generation_chain.invoke(input_data).response

            # Step 5: Evaluate how well the response is supported by its context
            print(f"Step 5: Assessing support for response {i+1}...")
            input_data = {"response": response, "context": context}
            support = support_chain.invoke(input_data).response.strip().lower()
            print(f"Support assessment: {support}")

            # Step 6: Score the usefulness of the response
            print(f"Step 6: Evaluating utility for response {i+1}...")
            input_data = {"query": query, "response": response}
            utility = int(utility_chain.invoke(input_data).response)
            print(f"Utility score: {utility}")

            # Collect response with its metadata
            responses.append((response, support, utility))

        # Choose the best response — prioritize strong support, then high utility
        print("Selecting the best response...")
        best_response = max(responses, key=lambda x: (x[1] == 'fully supported', x[2]))
        print(f"Best response support: {best_response[1]}, utility: {best_response[2]}")
        return best_response[0]
    else:
        # If no retrieval is needed, generate directly without context
        print("Generating without retrieval...")
        input_data = {"query": query, "context": "No retrieval necessary."}
        return generation_chain.invoke(input_data).response

- The function begins by calling the retrieval decision chain to determine whether any background knowledge is needed at all. This avoids unnecessary vector searches and keeps the system efficient.
- If retrieval is necessary, we query the vector store (FAISS) and grab the top-k similar chunks based on the query's embedding.
- We then screen these chunks using the relevance chain, only keeping those deemed genuinely relevant to the query.
- For each relevant context, the system generates a response — and critically, it doesn't stop there.
- Every response is passed through two evaluations:
  - Is it supported by the context it used?
  - How useful is it to the query (1–5)?
- Finally, the response with the strongest backing and highest utility is selected and returned.

This is what makes self-RAG intelligent: it is not just pulling text — it is thinking through whether the context matters, evaluating its own answers, and justifying the final response it returns.


### Running Self-RAG — High vs. low relevance queries
Time to see it in action. We will test the self-RAG pipeline with two very different types of queries:
- A relevant query that clearly relates to the loaded document ("Understanding Climate Change").
- An off-topic query that has nothing to do with the content.

This helps validate that the system can:
- Retrieve and ground answers when useful context is available.
- Avoid hallucination and gracefully fall back when no relevant information exists.

#### Test case 1: On-topic query — High relevance
This is a straightforward test where the question should align well with the content of the climate change PDF.

In [6]:
# Query that matches the domain of the document
query = "What is the impact of climate change on the environment?"
# Run the self-RAG pipeline with this query
response = self_rag(query, vectorstore)

print("\nFinal response:")
print(response)


Processing query: What is the impact of climate change on the environment?
Step 1: Determining if retrieval is necessary...
Retrieval decision: yes
Step 2: Retrieving relevant documents...
Retrieved 3 documents
Step 3: Evaluating relevance of retrieved documents...
Document 1 relevance: relevant
Document 2 relevance: relevant
Document 3 relevance: relevant
Number of relevant contexts: 3
Step 4: Generating responses using relevant contexts...
Generating response for context 1...
Step 5: Assessing support for response 1...
Support assessment: fully supported
Step 6: Evaluating utility for response 1...
Utility score: 5
Generating response for context 2...
Step 5: Assessing support for response 2...
Support assessment: fully supported
Step 6: Evaluating utility for response 2...
Utility score: 5
Generating response for context 3...
Step 5: Assessing support for response 3...
Support assessment: fully supported
Step 6: Evaluating utility for response 3...
Utility score: 5
Selecting the be

This query should trigger retrieval and return a high-quality, grounded answer from the climate change PDF. This demonstrates how Self-RAG leans on the knowledge base when it makes sense to do so.


### Test case 2: Off-topic query — Low/No Relevance
Here we are asking a question about Harry Potter — something that’s obviously unrelated to climate science.

In [8]:
# Query that is unrelated to the climate change domain
query = "how did harry beat quirrell?"
# Run the self-RAG pipeline
response = self_rag(query, vectorstore)

print("\nFinal response:")
print(response)


Processing query: how did harry beat quirrell?
Step 1: Determining if retrieval is necessary...
Retrieval decision: no
Generating without retrieval...

Final response:
Harry Potter defeated Professor Quirrell in the first book, "Harry Potter and the Sorcerer's Stone," through a combination of luck and the protective magic of his mother's sacrifice. Quirrell was trying to steal the Sorcerer's Stone for Voldemort, who was possessing him. When Quirrell attempted to touch Harry, he was unable to do so because of the love and protection that Harry's mother, Lily Potter, had bestowed upon him by sacrificing her life to save him. This protection caused Quirrell great pain and ultimately led to his defeat, allowing Harry to prevent Voldemort from obtaining the Stone.


This query is off-topic. Self-RAG should identify that no relevant documents exist and still attempt to answer appropriately (or at least transparently). This test shows that the model can admit when it does not know — and that is a feature, not a flaw.