# Simple Fully Local RAG Agents Using LangChain with LLaMA3.2

The purpose is to build a fully local RAG (Retrieval-Augmented Generation) system leveraging LLaMA3.2 as the LLM.
In this notebook we will add checks to the response to see if the llm hallucinated and if the response is correct.

In [1]:
import json
import os

import chromadb
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_community.document_loaders import PDFPlumberLoader
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv("../../.env.research")

True

In [2]:
### LLM
local_llm = "llama3.2:latest"
local_llm_json = "llama3.2:3b-instruct-fp16"
llm = ChatOllama(model=local_llm, temperature=0.1)
llm_json_mode = ChatOllama(model=local_llm_json, temperature=0, format="json")

In [3]:
def load_pdfs_from_directory(directory_path):
    all_documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(directory_path, filename)
            loader = PDFPlumberLoader(file_path=file_path)
            documents = loader.load()
            all_documents.extend(documents)
    return all_documents

In [4]:
client = chromadb.HttpClient(
    host=os.getenv("CHROMA_HOST"), port=int(os.getenv("CHROMA_PORT"))
)
vectorstore = None
if os.getenv("CHROMA_COLLECTION_NAME") not in [
    collection.name for collection in client.list_collections()
]:
    client.create_collection(os.getenv("CHROMA_COLLECTION_NAME"))
    pdf_docs = load_pdfs_from_directory(os.getenv("DATA_DIR"))
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
    all_splits = text_splitter.split_documents(pdf_docs)
    vectorstore = Chroma.from_documents(
        documents=all_splits,
        embedding=OllamaEmbeddings(model="nomic-embed-text"),
        collection_name=os.getenv("CHROMA_COLLECTION_NAME"),
        client=client,
    )
else:
    vectorstore = Chroma(
        collection_name=os.getenv("CHROMA_COLLECTION_NAME"),
        client=client,
        embedding_function=OllamaEmbeddings(model="nomic-embed-text"),
    )

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

In [5]:
### Router

# Prompt
router_instructions = """You are a routing assistant that determines the best data source to answer user questions. There are two options:

Vectorstore: Use this for **specific and detailed questions** about the rules, setup, or gameplay of Rummikub.
Irrelevant: Use this for **questions unrelated to Rummikub**, including general questions about unrelated topics, technologies, or concepts or uncertain questions.

**Output Format:**
Respond strictly in the following JSON format:

{"datasource": "<vectorstore | irrelevant>"}

**Guidelines:**
1. Use `vectorstore` **only** for questions directly referencing Rummikub rules, setup, or gameplay. Examples:
   - "What are the rules for forming sets in Rummikub?"
   - "How many tiles does each player start with?"
3. Use `irrelevant` for all other questions, including general knowledge or unrelated topics. Examples:
   - "What are the main causes of global warming?"
   - "What are the types of agent memory?"

Always prioritize the most accurate category.
"""

In [6]:
### Generate

# Prompt
rag_prompt = """You are an assistant for question-answering tasks.

Here is the context to use to answer the question:

{context}

Carefully analyze the context above.

Now, review the user question:

{question}

Provide a clear, direct, and concise answer to the user's question, using only the information from the context. Avoid repeating long excerpts from the context verbatim unless necessary to clarify your response.

Make sure the answer is well-structured, easy to understand, and directly addresses the user's query.

Answer:"""

rag_template = PromptTemplate(
    template=rag_prompt, input_variables=["context", "question"]
)

In [7]:
### Hallucination Grader

# Hallucination grader instructions
hallucination_grader_instructions = """

You are a teacher grading a quiz.

You will be given FACTS and a STUDENT ANSWER.

Here is the grade criteria to follow:

(1) Ensure the STUDENT ANSWER is grounded in the FACTS.

(2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

Score:

A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.

A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
"""

# Grader prompt
hallucination_grader_prompt_template = """FACTS: \n\n {documents} \n\n STUDENT ANSWER: {generation}.

Return JSON with one key, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER is grounded in the FACTS."""

In [8]:
### Answer Grader

# Answer grader instructions
answer_grader_instructions = """You are a teacher grading a quiz.

You will be given a QUESTION and a STUDENT ANSWER.

Here is the grade criteria to follow:

(1) The STUDENT ANSWER helps to answer the QUESTION

Score:

A score of yes means that the student's answer meets all of the criteria. This is the highest (best) score.

The student can receive a score of yes if the answer contains extra information that is not explicitly asked for in the question.

A score of no means that the student's answer does not meet all of the criteria. This is the lowest possible score you can give.
"""

# Grader prompt
answer_grader_prompt_template = """QUESTION: \n\n {question} \n\n STUDENT ANSWER: {generation}.

Return JSON with one key, binary_score is 'yes' or 'no' score to indicate whether the STUDENT ANSWER meets the criteria."""

In [9]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [10]:
### Chains
rag_chain = rag_template | llm | StrOutputParser()

In [11]:
def rag(question: str, max_retries: int = 3):
    # Route the question
    print("---ROUTING---")

    json_routing_response = llm_json_mode.invoke(
        [SystemMessage(content=router_instructions), HumanMessage(content=question)]
    )
    routing_response = json.loads(json_routing_response.content)
    datasource = routing_response["datasource"]

    if datasource == "vectorstore":
        # Retrieve documents
        print("---RETRIEVING---")
        documents = retriever.invoke(question)
        context = format_docs(documents)
        loop_step = 0
        response = None

        while loop_step < max_retries:
            # Answer the question
            print("---GENERATING---")
            response = rag_chain.invoke({"question": question, "context": context})
            # Grade the hallucination
            print("---GRADING HALLUCINATION---")
            hallucination_formatted_prompt = (
                hallucination_grader_prompt_template.format(
                    documents=context,
                    generation=response,
                )
            )
            json_hallucination_grader_response = llm_json_mode.invoke(
                [SystemMessage(content=hallucination_grader_instructions)]
                + [HumanMessage(content=hallucination_formatted_prompt)]
            )
            hallucination_grader_response = json.loads(
                json_hallucination_grader_response.content
            )

            # Grade the answer
            print("---GRADING ANSWER---")
            answer_formatted_prompt = answer_grader_prompt_template.format(
                question=question,
                generation=response,
            )
            json_answer_grader_response = llm_json_mode.invoke(
                [SystemMessage(content=answer_grader_instructions)]
                + [HumanMessage(content=answer_formatted_prompt)]
            )
            answer_grader_response = json.loads(json_answer_grader_response.content)

            if (
                hallucination_grader_response["binary_score"] == "yes"
                and answer_grader_response["binary_score"] == "yes"
            ):
                print("---SUCCESS---")
                break
        return response
    return "Question is not related to Rummikub"

In [12]:
rag("What is the capital of France?")

---ROUTING---


'Question is not related to Rummikub'

In [13]:
rag("What do I use a joker for?")

---ROUTING---
---RETRIEVING---
---GENERATING---
---GRADING HALLUCINATION---
---GRADING ANSWER---
---GENERATING---
---GRADING HALLUCINATION---
---GRADING ANSWER---
---SUCCESS---


'A joker must be used immediately as part of a meld on the table when it is picked up, and its use is restricted by the specific rules being followed. The primary purpose of using a joker is to complete or enhance an existing meld, rather than storing it for future use.'

In [14]:
rag("How can I win?")

---ROUTING---
---RETRIEVING---
---GENERATING---
---GRADING HALLUCINATION---
---GRADING ANSWER---
---GENERATING---


KeyboardInterrupt: 

In [15]:
rag("What is manipulation in Rummikub?")

---ROUTING---
---RETRIEVING---
---GENERATING---
---GRADING HALLUCINATION---
---GRADING ANSWER---
---SUCCESS---


"Manipulation in Rummikub refers to the process of rearranging or adding tiles to existing sets on the table to create new sets. This can be done by adding one or more tiles from the rack to an existing set, removing a tile from a group and using it to form a new set, splitting a run into two separate groups, or combining two groups into one. The goal of manipulation is to create legitimate sets that meet the game's rules, while also trying to table the greatest amount of tiles possible."

# Conclusion
Easy to use, and there are checks, but it is challenging to debug and to see what paths the chain took. So that is why langgraph is way nicer to use because you can see the in-and outputs of each step.