# Corrective Rag(CRAG) using local LLMS

Corrective-RAG (CRAG) is a strategy for RAG that incorporates self-reflection / self-grading on retrieved documents.

The logic follows this general flow:

- If at least one document exceeds the threshold for relevance, then it proceeds to generation
- If all documents fall below the relevance threshold or if the grader is unsure, then it uses web search to supplement retrieval
Before generation, it performs knowledge refinement of the search or retrieved documents
- This partitions the document into knowledge strips
It grades each strip, and filters out irrelevant ones
We will implement some of these ideas from scratch using LangGraph:

- If any documents are irrelevant, we'll supplement retrieval with web search.
We'll skip the knowledge refinement, but this can be added back as a node if desired.
We'll use Tavily Search for web search.

In [1]:
import getpass 
import os 
from dotenv import load_dotenv

load_dotenv()

GROK_API_KEY = os.getenv("GROK_API_KEY")
TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
LANGSMITH_API_KEY = os.getenv("LANGSMITH_API_KEY")
HG_FACE_API_KEY = os.getenv("HG_FACE_API_KEY")

In [2]:
import ollama 
available_models = ollama.list()
available_models

ListResponse(models=[Model(model='wizardlm2:latest', modified_at=datetime.datetime(2025, 3, 1, 23, 49, 4, 295170, tzinfo=TzInfo(+03:00)), digest='c9b1aff820f245a43b1719b296ce7131746073691cbb56f7a8d88a4713a3df79', size=4108928625, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='7B', quantization_level='Q4_0')), Model(model='mxbai-embed-large:latest', modified_at=datetime.datetime(2025, 3, 1, 23, 43, 45, 523887, tzinfo=TzInfo(+03:00)), digest='468836162de7f81e041c43663fedbbba921dcea9b9fefea135685a39b2d83dd8', size=669615493, details=ModelDetails(parent_model='', format='gguf', family='bert', families=['bert'], parameter_size='334M', quantization_level='F16')), Model(model='nomic-embed-text:latest', modified_at=datetime.datetime(2025, 2, 25, 15, 1, 57, 368958, tzinfo=TzInfo(+03:00)), digest='0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f', size=274302450, details=ModelDetails(parent_model='', format='gguf', family=

In [3]:
# local_llm = "llama3.2:latest"
local_llm = "qwen2.5:latest"
model_tested = "llama3.2:latest"
metadata = f"CRAG, {model_tested}"

# Creating Index or Vector store 

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import SKLearnVectorStore, Chroma
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings


# URL list to scrape
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]

# load documents from url 
docs = [WebBaseLoader(url).load() for url in urls]
docs_lists = [item for sublist in docs for item in sublist] # Basically flatten the list

# Initialize a text splitter with specified chunksize and overlap 

test_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size = 250,
    chunk_overlap = 0
)

# Split the documents into smaller chunks 
doc_splits = test_splitter.split_documents(docs_lists)
print(f"Number of documents: {len(docs_lists)}")
print(f"Number of splits: {len(doc_splits)}")
# print(f"First split: {doc_splits[0].page_content}")
# print(f"First document: {docs_lists[0].page_content}")

# Embeddings 

embedding = OllamaEmbeddings(model="chroma/all-minilm-l6-v2-f32:latest")

vectorstore = Chroma.from_documents(
    documents = doc_splits,
    collection_name = "rag_chroma",
    embedding = embedding
)


retriever = vectorstore.as_retriever(  search_type="mmr",
        search_kwargs={'k': 5})

Number of documents: 3
Number of splits: 194


In [43]:
question = "agent memory"

docs = retriever.invoke(question)
# print(docs[])
print(docs)

[Document(metadata={'title': "LLM Powered Autonomous Agents | Lil'Log", 'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemory\n\nShort-term memory: I would consider all the in

In [45]:
# Define Tools
### Retrieval Grader 

from langchain.prompts import PromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import JsonOutputParser

llm = ChatOllama(model=local_llm, format="json", temperature=0)

# Prompt 
prompt = PromptTemplate(
    template="""You are a teacher grading a quiz. You will be given: 
    1/ a QUESTION
    2/ A FACT provided by the student
    
    You are grading RELEVANCE RECALL:
    A score of 1 means that ANY of the statements in the FACT are relevant to the QUESTION. 
    A score of 0 means that NONE of the statements in the FACT are relevant to the QUESTION. 
    1 is the highest (best) score. 0 is the lowest score you can give. 
    
    Explain your reasoning in a step-by-step manner. Ensure your reasoning and conclusion are correct. 
    
    Avoid simply stating the correct answer at the outset.
    
    Question: {question} \n
    Fact: \n\n {documents} \n\n
    
    Give a binary score '1' or '0' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no premable or explanation.
    """,
    input_variables=["question", "documents"],
)

retrieval_grader = prompt | llm | JsonOutputParser()
question = "agent memory"

docs = retriever.invoke(question)

print(retrieval_grader.invoke({"question": question, "documents": docs}))


{'score': 1}


In [46]:
### Generate 
from langchain_core.output_parsers import StrOutputParser

# Prompt 
prompt = PromptTemplate(
    template = """You are an assistant for question-answering tasks. 
    
    Use the following documents to answer the question. 
    
    If you don't know the answer, just say that you don't know. 
    
    Use three sentences maximum and keep the answer concise:
    Question: {question} 
    Documents: {documents} 
    Answer: 
    """,
    input_variables=["question", "documents"],
)

# LLM 
llm = ChatOllama(model=local_llm, temperature=0)

# Chain 
rag_chain = prompt | llm | StrOutputParser()

# Runnning 

generation = rag_chain.invoke({"documents": docs, "question": question})
print(generation)

Based on the information provided, it seems you are discussing an LLM-powered autonomous agent system and its components, particularly focusing on planning, memory, and tool use.

### Planning:
- **Subgoal Decomposition**: The agent breaks down complex tasks into smaller, manageable subgoals to handle complexity efficiently.
- **Self-Criticism and Reflection**: The agent can evaluate past actions, learn from mistakes, and refine future steps to improve overall performance.

### Memory:
- **Short-Term Memory (ST):** Utilizes in-context learning or prompt engineering techniques where the model learns from immediate inputs and context.
- **Long-Term Memory (LT):** Employs external vector stores for retaining and retrieving vast amounts of information over extended periods, enabling the agent to recall historical data and knowledge.

### Tool Use:
- The agent can access external APIs to gather necessary information not available in its pre-trained model weights. This includes current data,