Project Introduction

This project implements a lightweight AI framework for automatic search and validation of documentary content using Retrieval-Augmented Generation (RAG).
It is designed for corporate environments where large PDFs, policies, procedures, and technical documents must be checked for the presence of specific requirements or compliance criteria.

The framework provides:

Document ingestion from PDF, DOCX, PPTX

Semantic chunking + embedding using OpenAI or local LLMs (Ollama)

Persistent vector database built with Chroma

RAG-based question answering grounded strictly on retrieved context

Requirement validation module that returns yes / no with justification

Hallucination-mitigation via context-supported verification

CLI and Flask API interfaces for interaction and integration

The goal is to demonstrate how modern LLMs + semantic search can automate manual document review tasks and enable consistent, auditable compliance checking.

# 0. Environment Setup

Before running any part of this RAG Document Validation framework, we need to
install all required Python libraries. These libraries support:

- LLM providers (OpenAI, Mistral, Ollama via LangChain)
- Vector stores (FAISS, Chroma, Pinecone)
- RAG pipeline tools (LangChain, LangGraph, LlamaIndex)
- Evaluation frameworks (RAGAS, datasets, evaluate)
- API server (Flask)
- Environment variable loading (python-dotenv)

The project includes a `requirements.txt` file containing all needed
dependencies. We install them using:



In [None]:
# Install project requirements
#!pip install -r requirements.txt



In [None]:
env_content = """
# ---- LLM Provider Settings ----
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3

# ---- Embeddings ----
# Use FREE local embeddings via Ollama
EMBED_PROVIDER=ollama
EMBED_MODEL=nomic-embed-text

# ---- Vector Store Settings ----
VECTOR_DB=chroma
COLLECTION=docval

# ---- API Keys ----
# All empty because we use NO paid services
OPENAI_API_KEY=
MISTRAL_API_KEY=
PINECONE_API_KEY=
PINECONE_INDEX=docval
"""

with open(".env", "w") as f:
    f.write(env_content)

print("Created FREE .env file successfully.")


Created FREE .env file successfully.


## 1. Imports, paths and environment

In this section I import all the libraries and configure the base paths.

- `os`, `argparse`, `json`, `datetime`: standard Python utilities.
- `dotenv.load_dotenv`: loads environment variables from the `.env` file (API keys, model names, etc.).
- LangChain loaders and tools:
  - `Docx2txtLoader`, `PyPDFLoader`, `UnstructuredPowerPointLoader` for reading `.docx`, `.pdf`, `.pptx`.
  - `RecursiveCharacterTextSplitter` for splitting long texts into chunks.
  - `Document` as the basic text+metadata container.
- Embeddings and vector stores:
  - `OpenAIEmbeddings` to convert text chunks into vectors.
  - `Chroma`, `FAISS`, `PineconeVectorStore` as vector DB backends.
  - `Pinecone` client for managed Pinecone indexes.
- Later in the file I also use:
  - LLM chat models (OpenAI / Mistral / Ollama).
  - `ChatPromptTemplate`, `MessagesPlaceholder`, `HumanMessage` to build prompts.
  - `Flask` for the HTTP API.

Then I define:
- `BASE_DIR`: folder where this file lives.
- `DATA_DIR`: directory where the raw documents to ingest are stored.
- `VECTOR_DIR`: directory where the vector indexes (Chroma/FAISS) are stored.
- `load_dotenv(...)`: loads `.env` from the project root so the code can read configuration from environment variables.


In [None]:
import os
import argparse
import json
from datetime import datetime

from dotenv import load_dotenv

from langchain_community.document_loaders import Docx2txtLoader, PyPDFLoader, UnstructuredPowerPointLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_community.vectorstores import FAISS
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone

# if you have more imports (Chat models, prompts, Flask, etc.) paste them here too

BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "data")
VECTOR_DIR = os.path.join(BASE_DIR, "vector")

load_dotenv(os.path.join(BASE_DIR, ".env"))


True

## 2. LLM selection (`get_chat_model`)

This function decides which large language model backend to use, based on environment variables.

Supported providers:
- **OpenAI GPT** (`LLM_PROVIDER=openai`): default choice, good general performance.
- **Mistral** (`LLM_PROVIDER=mistral`): hosted Mistral models, often cheaper and EU-friendly.
- **Llama 3 via Ollama** (`LLM_PROVIDER=ollama`): local or self-hosted option (e.g. llama3).

Environment variables:
- `LLM_PROVIDER`: which backend to use (`openai`, `mistral`, `ollama`).
- `CHAT_MODEL`: main model name for OpenAI (e.g. `gpt-4o-mini`).
- `MISTRAL_MODEL`: model name for Mistral (e.g. `mistral-large-latest`).
- `OLLAMA_MODEL`: model name for Ollama (e.g. `llama3`).

The idea: the rest of the code just calls `get_chat_model()` and doesn’t care which provider is behind it.


In [None]:
from langchain_openai import ChatOpenAI
from langchain_mistralai import ChatMistralAI
from langchain_community.chat_models import ChatOllama

def get_chat_model():
    provider = os.getenv("LLM_PROVIDER", "ollama").lower()
    # CHAT_MODEL is only used for OpenAI
    model = os.getenv("CHAT_MODEL", "gpt-4o-mini")

    # ---- OpenAI (requires API key) ----
    if provider == "openai":
        return ChatOpenAI(
            model=model,
            temperature=0,
        )

    # ---- Mistral (requires API key) ----
    if provider == "mistral":
        return ChatMistralAI(
            model=os.getenv("MISTRAL_MODEL", "mistral-large-latest"),
            temperature=0,
        )

    # ---- Ollama (FREE, local) ----
    if provider == "ollama":
        return ChatOllama(
            model=os.getenv("OLLAMA_MODEL", "llama3"),
            temperature=0,
        )

    # ---- fallback: Ollama (free) ----
    return ChatOllama(model="llama3", temperature=0)


In [None]:
llm = get_chat_model()
llm


  return ChatOllama(


ChatOllama(model='llama3', temperature=0.0)

## 3. Vector store selection (`get_retriever`)

This function chooses which vector database backend to use for retrieval, and returns a LangChain `retriever`.

Supported vector DBs:
- **Chroma** (`VECTOR_DB=chroma`)
  - Local on-disk vector DB.
  - Good for a single machine with persistence.
- **FAISS** (`VECTOR_DB=faiss`)
  - In-memory local index stored on disk.
  - Good for experiments and small/medium projects.
- **Pinecone** (`VECTOR_DB=pinecone`)
  - Managed cloud vector DB.
  - Good for scalable / production deployments.

Steps:
1. Create an embedding model: `OpenAIEmbeddings` with `EMBED_MODEL`.
2. Read `VECTOR_DB` env to decide backend.
3. For each backend:
   - Open or create the index (Chroma/FAISS/Pinecone).
   - Wrap it as a `retriever` with `k` nearest neighbors (and for Chroma use MMR).
4. Return the retriever so other functions can just call `.invoke(query)` to get relevant chunks.


In [None]:
from langchain_community.embeddings import OllamaEmbeddings   # FREE local embeddings
from langchain_chroma import Chroma
from langchain_community.vectorstores import FAISS
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone


In [None]:
from langchain_community.embeddings import OllamaEmbeddings

def get_embeddings():
    provider = os.getenv("EMBED_PROVIDER", "ollama").lower()
    model = os.getenv("EMBED_MODEL", "nomic-embed-text")

    print(f"[Embedding] Provider: {provider}")
    print(f"[Embedding] Model: {model}")

    if provider == "ollama":
        print("[Embedding] Using FREE local Ollama embeddings.")
        return OllamaEmbeddings(model=model)

    # Only used if provider=openai
    print("[Embedding] Using OpenAI embeddings (PAID).")
    from langchain_openai import OpenAIEmbeddings
    return OpenAIEmbeddings(model=model)



from langchain_chroma import Chroma
from langchain_community.vectorstores import FAISS
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone

def get_retriever():
    print("======================================")
    print("      INITIALIZING RETRIEVER")
    print("======================================")

    backend = os.getenv("VECTOR_DB", "chroma").lower()
    print(f"[Retriever] Selected backend: {backend}")

    embeddings = get_embeddings()   # This will print too

    # ---------- CHROMA ----------
    if backend == "chroma":
        persist_dir = os.path.join(VECTOR_DIR, "chroma")
        os.makedirs(persist_dir, exist_ok=True)

        print(f"[Chroma] Persist directory: {persist_dir}")
        print("[Chroma] Loading or creating Chroma DB...")

        vectordb = Chroma(
            collection_name=os.getenv("COLLECTION", "docval"),
            embedding_function=embeddings,
            persist_directory=persist_dir,
        )

        print("[Chroma] Vector DB ready.")
        print("[Chroma] Returning MMR retriever.")

        return vectordb.as_retriever(
            search_type="mmr",
            search_kwargs={"k": 8, "fetch_k": 40, "lambda_mult": 0.5},
        )

    # ---------- FAISS ----------
    if backend == "faiss":
        faiss_dir = os.path.join(VECTOR_DIR, "faiss")
        os.makedirs(faiss_dir, exist_ok=True)

        print(f"[FAISS] Folder: {faiss_dir}")

        index_path = os.path.join(faiss_dir, "index.faiss")
        print(f"[FAISS] Index path: {index_path}")

        if os.path.exists(index_path):
            print("[FAISS] Found existing FAISS index. Loading...")
            vectordb = FAISS.load_local(
                faiss_dir,
                embeddings,
                allow_dangerous_deserialization=True,
            )
        else:
            print("[FAISS] No index found. Creating empty FAISS index...")
            vectordb = FAISS.from_texts([""], embeddings)
            vectordb.save_local(faiss_dir)
            print("[FAISS] Created empty FAISS index.")

        print("[FAISS] Returning retriever.")
        return vectordb.as_retriever(search_kwargs={"k": 8})

    # ---------- PINECONE ----------
    if backend == "pinecone":
        print("[Pinecone] Initializing Pinecone backend...")
        pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY", ""))

        index_name = os.getenv("PINECONE_INDEX", os.getenv("COLLECTION", "docval"))
        print(f"[Pinecone] Index name: {index_name}")

        dim = int(os.getenv("EMBED_DIM", "768"))
        print(f"[Pinecone] Embedding dimension: {dim}")

        existing = [idx["name"] for idx in pc.list_indexes()]
        print(f"[Pinecone] Existing indexes: {existing}")

        if index_name not in existing:
            print("[Pinecone] Index not found. Creating new index...")
            pc.create_index(
                name=index_name,
                dimension=dim,
                metric="cosine",
            )

        print("[Pinecone] Connecting to existing index...")
        vectordb = PineconeVectorStore.from_existing_index(
            index_name=index_name,
            embedding=embeddings,
        )

        print("[Pinecone] Returning retriever.")
        return vectordb.as_retriever(search_kwargs={"k": 8})

    # ---------- INVALID ----------
    print(f"[ERROR] Unsupported VECTOR_DB backend: {backend}")
    raise ValueError(f"Unsupported VECTOR_DB: {backend}")


In [None]:
print("---- TESTING EMBEDDINGS ----")
emb = get_embeddings()
print("Embedding object:", emb)

print("\n---- TESTING RETRIEVER ----")
retriever = get_retriever()
print("Retriever object:", retriever)


---- TESTING EMBEDDINGS ----
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
Embedding object: base_url='http://localhost:11434' model='nomic-embed-text' embed_instruction='passage: ' query_instruction='query: ' mirostat=None mirostat_eta=None mirostat_tau=None num_ctx=None num_gpu=None num_thread=None repeat_last_n=None repeat_penalty=None temperature=None stop=None tfs_z=None top_k=None top_p=None show_progress=False headers=None model_kwargs=None

---- TESTING RETRIEVER ----
      INITIALIZING RETRIEVER
[Retriever] Selected backend: chroma
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
[Chroma] Persist directory: /Users/mona/test project of rag/test of rag/vector/chroma
[Chroma] Loading or creating Chroma DB...
[Chroma] Vector DB ready.
[Chroma] Returning MMR retriever.
Retriever object: tags=['Chroma', 'OllamaEmbeddings'] vectorstore=<langchain_chroma.v

  return OllamaEmbeddings(model=model)


## 4. File loading and chunking helpers

These helper functions standardize how I read and preprocess documents.

- `_file_type(path)`: looks at the file extension and returns a simple type label:
  - `.docx` → `"docx"`
  - `.pdf` → `"pdf"`
  - `.pptx` → `"pptx"`
  - anything else → `"other"` (ignored)

- `_load_text(abs_path, ftype)`: given an absolute path and a file type:
  - For `docx`, uses `Docx2txtLoader` to extract text.
  - For `pdf`, uses `PyPDFLoader` to read the pages (here I join them into one string).
  - For `pptx`, uses `UnstructuredPowerPointLoader` and concatenates slide texts.

- `_chunk_text(text)`: splits a long text into overlapping chunks using
  `RecursiveCharacterTextSplitter` with:
  - `chunk_size = 1000` characters
  - `chunk_overlap = 200` characters
  - splitting on paragraph/line/sentence boundaries when possible.

The goal is to get a list of `Document` chunks ready for embeddings and retrieval.


In [None]:
def _file_type(path: str) -> str:
    ext = os.path.splitext(path)[1].lower()
    if ext == ".docx":
        return "docx"
    if ext == ".pdf":
        return "pdf"
    if ext == ".pptx":
        return "pptx"
    return "other"


def _load_text(abs_path: str, ftype: str) -> str:
    if ftype == "docx":
        docs = Docx2txtLoader(abs_path).load()
        return docs[0].page_content
    if ftype == "pdf":
        docs = PyPDFLoader(abs_path).load()
        return docs[0].page_content
    if ftype == "pptx":
        slides = UnstructuredPowerPointLoader(abs_path).load()
        return "\n\n".join(d.page_content or "" for d in slides)
    raise ValueError(f"Unsupported file type: {ftype}")


def _chunk_text(text: str):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " "],
    )
    return splitter.split_documents([Document(page_content=text)])

print("✔ File helper functions LOADED.")


✔ File helper functions LOADED.


## 5. Ingestion pipeline (`ingest`)

The `ingest()` function scans the `data/` directory, loads all supported documents,
splits them into chunks, and stores them in the chosen vector database.

Main steps:
1. Ensure `DATA_DIR` exists.
2. Create an embedding model (`OpenAIEmbeddings` with `EMBED_MODEL`).
3. Read `VECTOR_DB` to decide backend (`chroma`, `faiss`, or `pinecone`).
4. Walk through all files in `data/`.
   - Skip temporary files (like `~$...` from MS Office).
   - Detect type with `_file_type`.
   - Use `_load_text` to read content.
   - Use `_chunk_text` to split into smaller `Document` chunks.
   - Attach metadata (source path, filename, type, ingestion timestamp).
5. Insert all chunks into the selected vector store.
6. Return a small summary: backend used and number of chunks ingested.

This is the step that turns raw PDFs/DOCX/PPTX into searchable vector representations.


In [None]:
from datetime import datetime

def ingest():
    print("======================================")
    print("             INGEST START")
    print("======================================")

    os.makedirs(DATA_DIR, exist_ok=True)

    backend = os.getenv("VECTOR_DB", "chroma").lower()
    collection = os.getenv("COLLECTION", "docval")

    print(f"[Ingest] Backend: {backend}")
    print(f"[Ingest] Collection: {collection}")
    print(f"[Ingest] Data folder: {DATA_DIR}")

    embeddings = get_embeddings()   # <-- FREE embeddings
    all_docs = []

    # -------------------------
    # LOAD FILES FROM /data
    # -------------------------
    for root, _, files in os.walk(DATA_DIR):
        for name in files:
            if name.startswith("~$"):
                continue

            ftype = _file_type(name)
            if ftype == "other":
                print(f"[Skip] Unsupported file: {name}")
                continue

            path = os.path.join(root, name)
            print(f"[Load] {path} (type={ftype})")

            text = _load_text(path, ftype)
            print(f"[Text] Loaded {len(text)} characters.")

            chunks = _chunk_text(text)
            print(f"[Chunks] Created {len(chunks)} chunks.")

            for d in chunks:
                meta = {
                    "source_path": path.replace("\\", "/"),
                    "filename": name,
                    "type": ftype,
                    "ingested_at": datetime.utcnow().isoformat() + "Z",
                }
                all_docs.append(
                    Document(page_content=d.page_content, metadata=meta)
                )

    if not all_docs:
        print("[Ingest] No documents found in data/.")
        print("======================================")
        print("             INGEST END")
        print("======================================")
        return {"backend": backend, "count": 0}

    # -------------------------
    # WRITE TO VECTOR STORE
    # -------------------------
    print(f"[Ingest] Total chunks to index: {len(all_docs)}")

    if backend == "chroma":
        persist_dir = os.path.join(VECTOR_DIR, "chroma")
        os.makedirs(persist_dir, exist_ok=True)

        print(f"[Chroma] Persist dir: {persist_dir}")
        vectordb = Chroma(
            collection_name=collection,
            embedding_function=embeddings,
            persist_directory=persist_dir,
        )
        vectordb.add_documents(all_docs)
        print("[Chroma] Documents added.")

    elif backend == "faiss":
        faiss_dir = os.path.join(VECTOR_DIR, "faiss")
        os.makedirs(faiss_dir, exist_ok=True)

        print(f"[FAISS] Folder: {faiss_dir}")

        texts = [d.page_content for d in all_docs]
        metas = [d.metadata for d in all_docs]

        vectordb = FAISS.from_texts(texts, embeddings, metadatas=metas)
        vectordb.save_local(faiss_dir)
        print("[FAISS] Index saved.")

    elif backend == "pinecone":
        pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY", ""))
        index_name = os.getenv("PINECONE_INDEX", collection)
        print(f"[Pinecone] Index name: {index_name}")

        dim = int(os.getenv("EMBED_DIM", "768"))
        print(f"[Pinecone] Embedding dimension: {dim}")

        existing = [idx["name"] for idx in pc.list_indexes()]
        if index_name not in existing:
            print("[Pinecone] Creating new index...")
            pc.create_index(name=index_name, dimension=dim, metric="cosine")

        PineconeVectorStore.from_documents(
            documents=all_docs,
            embedding=embeddings,
            index_name=index_name,
        )
        print("[Pinecone] Documents saved.")

    print("======================================")
    print("             INGEST END")
    print("======================================")
    return {"backend": backend, "count": len(all_docs)}


In [None]:
result = ingest()
result


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 30 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)


             INGEST START
[Ingest] Backend: chroma
[Ingest] Collection: docval
[Ingest] Data folder: /Users/mona/test project of rag/test of rag/data
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
[Load] /Users/mona/test project of rag/test of rag/data/ACME_Information_Security_Policy.pdf (type=pdf)


  "ingested_at": datetime.utcnow().isoformat() + "Z",
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 30 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)


[Text] Loaded 1364 characters.
[Chunks] Created 2 chunks.
[Skip] Unsupported file: README.md
[Load] /Users/mona/test project of rag/test of rag/data/.ipynb_checkpoints/ACME_Information_Security_Policy-checkpoint.pdf (type=pdf)
[Text] Loaded 1364 characters.
[Chunks] Created 2 chunks.
[Ingest] Total chunks to index: 4
[Chroma] Persist dir: /Users/mona/test project of rag/test of rag/vector/chroma
[Chroma] Documents added.
             INGEST END


{'backend': 'chroma', 'count': 4}

## 6. RAG chat (`chat`)

Here I implement the main RAG question-answer function.

- `_concat(docs)`: joins the `page_content` of retrieved documents into a single `context` string.
- `chat(question)`: 
  1. Gets a `retriever` and an LLM via `get_retriever()` and `get_chat_model()`.
  2. Builds a prompt that:
     - Instructs the model to answer **only using the provided context**.
     - Says to admit "I don't know" if the context is insufficient.
  3. Calls the retriever with the question to get relevant chunks.
  4. Concatenates them into a `context` string.
  5. Sends question + context to the LLM.
  6. Returns:
     - `answer`: the model’s answer.
     - `context`: the actual context used.
     - `docs`: basic metadata (path, filename) of retrieved documents.

This is the core RAG QA call used in the API and CLI.


In [None]:
from langchain_core.messages import HumanMessage
from langchain_core.prompts import ChatPromptTemplate

def _concat(docs):
    return "\n\n".join(d.page_content for d in docs)


def chat(question: str):
    print("\n======================================")
    print("               CHAT START")
    print("======================================")

    retriever = get_retriever()
    llm = get_chat_model()

    print("[Chat] Retrieving documents...")
    docs = retriever.get_relevant_documents(question)
    print(f"[Chat] Retrieved {len(docs)} docs.")

    context = _concat(docs)

    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a strict document assistant. Answer ONLY using this context. "
                "If answer is not in context, say: 'I don't know'."
            ),
            ("system", "Context:\n{context}"),
            ("human", "{question}"),
        ]
    )

    chain = prompt | llm

    print("[Chat] Calling LLM...")
    ai = chain.invoke(
        {
            "context": context,
            "question": question,
        }
    )

    print("[Chat] DONE.")
    print("======================================")

    return {
        "answer": ai.content,
        "context": context,
        "docs": [
            {
                "source_path": d.metadata.get("source_path"),
                "filename": d.metadata.get("filename"),
            }
            for d in docs
        ],
    }


In [None]:
resp = chat("What policies does the document define?")
print("\nANSWER:\n", resp["answer"])
print("\nSOURCES:\n", resp["docs"])



               CHAT START
      INITIALIZING RETRIEVER
[Retriever] Selected backend: chroma
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
[Chroma] Persist directory: /Users/mona/test project of rag/test of rag/vector/chroma
[Chroma] Loading or creating Chroma DB...
[Chroma] Vector DB ready.
[Chroma] Returning MMR retriever.
[Chat] Retrieving documents...


  return OllamaEmbeddings(model=model)
  docs = retriever.get_relevant_documents(question)
Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Chat] Retrieved 4 docs.
[Chat] Calling LLM...
[Chat] DONE.

ANSWER:
 The document defines an "Information Security Policy".

SOURCES:
 [{'source_path': '/Users/mona/test project of rag/test of rag/data/ACME_Information_Security_Policy.pdf', 'filename': 'ACME_Information_Security_Policy.pdf'}, {'source_path': '/Users/mona/test project of rag/test of rag/data/.ipynb_checkpoints/ACME_Information_Security_Policy-checkpoint.pdf', 'filename': 'ACME_Information_Security_Policy-checkpoint.pdf'}, {'source_path': '/Users/mona/test project of rag/test of rag/data/ACME_Information_Security_Policy.pdf', 'filename': 'ACME_Information_Security_Policy.pdf'}, {'source_path': '/Users/mona/test project of rag/test of rag/data/.ipynb_checkpoints/ACME_Information_Security_Policy-checkpoint.pdf', 'filename': 'ACME_Information_Security_Policy-checkpoint.pdf'}]


## 7. Answer verification (`verify`)

This function checks if a generated answer is actually supported by the retrieved context.

Steps:
1. Build a prompt that shows:
   - Question
   - Answer
   - Context
2. Ask the LLM to classify the answer as one of:
   - `supported`
   - `partially_supported`
   - `unsupported`
3. Map this verdict to a numeric `confidence`:
   - 1.0 → supported
   - 0.5 → partially_supported
   - 0.0 → unsupported

The goal is to detect hallucinations and give the user a confidence score.


In [None]:
from langchain_core.prompts import ChatPromptTemplate

def verify(question: str, answer: str, context: str):
    print("\n======================================")
    print("             VERIFY START")
    print("======================================")

    llm = get_chat_model()

    vprompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a strict verifier. Decide if the ANSWER is supported by the CONTEXT. "
                "Return ONLY one word:\n"
                "- supported\n"
                "- partially_supported\n"
                "- unsupported"
            ),
            (
                "human",
                "QUESTION:\n{q}\n\nANSWER:\n{a}\n\nCONTEXT:\n{c}\n\nReturn the verdict:"
            ),
        ]
    )

    chain = vprompt | llm

    print("[Verify] Calling verifier model...")
    raw = chain.invoke({"q": question, "a": answer, "c": context}).content.strip().lower()
    print(f"[Verify] Raw model output: '{raw}'")

    # Normalize
    if "supported" == raw:
        verdict = "supported"
        score = 1.0
    elif "partially" in raw:
        verdict = "partially_supported"
        score = 0.5
    else:
        verdict = "unsupported"
        score = 0.0

    print("[Verify] Verdict:", verdict)
    print("[Verify] Confidence:", score)
    print("======================================")

    return {"verdict": verdict, "confidence": score}


In [None]:
resp = chat("Does the policy include incident response?")
answer = resp["answer"]
context = resp["context"]

verify("Does the policy include incident response?", answer, context)



               CHAT START
      INITIALIZING RETRIEVER
[Retriever] Selected backend: chroma
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
[Chroma] Persist directory: /Users/mona/test project of rag/test of rag/vector/chroma
[Chroma] Loading or creating Chroma DB...
[Chroma] Vector DB ready.
[Chroma] Returning MMR retriever.
[Chat] Retrieving documents...


Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Chat] Retrieved 4 docs.
[Chat] Calling LLM...
[Chat] DONE.

             VERIFY START
[Verify] Calling verifier model...
[Verify] Raw model output: 'supported'
[Verify] Verdict: supported
[Verify] Confidence: 1.0


{'verdict': 'supported', 'confidence': 1.0}

## 8. Requirement validation (`validate`)

This function implements the "document validation" logic used for compliance (e.g. ISO 27001).

For each requirement string:
1. Use the retriever to fetch relevant chunks from the document.
2. Build a context string from those chunks.
3. Ask the LLM: 
   - Is this requirement present in the document, based on the context?
   - Answer "yes" or "no" and give a short justification.
4. Interpret the answer:
   - If it starts with "yes" → `present = True`
   - Otherwise → `present = False`
5. Return a list of results with:
   - `requirement`
   - `present` (bool)
   - `raw` (full LLM justification)
   - `docs` (metadata of supporting chunks)

This is the core feature that matches the thesis: automatic content / requirement checking.


In [None]:
from langchain_core.prompts import ChatPromptTemplate

def validate(requirements):
    print("\n======================================")
    print("           VALIDATION START")
    print("======================================")

    retriever = get_retriever()
    llm = get_chat_model()

    classifier = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a compliance checker. "
                "Decide if the REQUIREMENT is present in the document, "
                "based ONLY on the given CONTEXT.\n"
                "Answer with exactly:\n"
                "  'yes: <short reason>'  or\n"
                "  'no: <short reason>'"
            ),
            (
                "human",
                "Requirement:\n{req}\n\nContext:\n{ctx}\n\nAnswer:"
            ),
        ]
    )

    chain = classifier | llm
    results = []

    for req in requirements:
        print("\n--------------------------------------")
        print(f"[Validate] Requirement: {req}")

        # get docs from retriever (NOT .invoke)
        docs = retriever.get_relevant_documents(req)
        print(f"[Validate] Retrieved {len(docs)} docs")

        ctx = _concat(docs)

        ans = chain.invoke({"req": req, "ctx": ctx}).content.strip()
        present = ans.lower().startswith("yes")

        print(f"[Validate] Raw answer: {ans}")
        print(f"[Validate] Present? {present}")

        results.append(
            {
                "requirement": req,
                "present": present,
                "raw": ans,
                "docs": [
                    {
                        "source_path": d.metadata.get("source_path"),
                        "filename": d.metadata.get("filename"),
                    }
                    for d in docs
                ],
            }
        )

    print("\n======================================")
    print("           VALIDATION END")
    print("======================================")

    return results


In [None]:
reqs = [
    "The policy must include an incident response process.",
    "The policy must define roles and responsibilities.",
    "The policy must describe backup and recovery.",
    "The policy must define VPN usage rules.",
]

results = validate(reqs)

for r in results:
    print("\nREQ:", r["requirement"])
    print("PRESENT:", r["present"])
    print("RAW:", r["raw"])
    print("FILES:", {d["filename"] for d in r["docs"]})



           VALIDATION START
      INITIALIZING RETRIEVER
[Retriever] Selected backend: chroma
[Embedding] Provider: ollama
[Embedding] Model: nomic-embed-text
[Embedding] Using FREE local Ollama embeddings.
[Chroma] Persist directory: /Users/mona/test project of rag/test of rag/vector/chroma
[Chroma] Loading or creating Chroma DB...
[Chroma] Vector DB ready.
[Chroma] Returning MMR retriever.

--------------------------------------
[Validate] Requirement: The policy must include an incident response process.


Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Validate] Retrieved 4 docs
[Validate] Raw answer: yes: The requirement is present in the document as it mentions "Coordinates incident response activities" under 3.1 CISO's responsibilities.
[Validate] Present? True

--------------------------------------
[Validate] Requirement: The policy must define roles and responsibilities.


Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Validate] Retrieved 4 docs
[Validate] Raw answer: yes: The policy defines roles and responsibilities for the CISO, IT Security Team, and Department Managers.
[Validate] Present? True

--------------------------------------
[Validate] Requirement: The policy must describe backup and recovery.


Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Validate] Retrieved 4 docs
[Validate] Raw answer: yes: The document mentions "all stages of the information lifecycle" which includes backup and recovery.
[Validate] Present? True

--------------------------------------
[Validate] Requirement: The policy must define VPN usage rules.


Number of requested results 40 is greater than number of elements in index 4, updating n_results = 4


[Validate] Retrieved 4 docs
[Validate] Raw answer: yes: The requirement is present in the document as it defines VPN usage rules under the scope section, which states "All informa0on assets, including on-premise systems, cloud environments, laptops, mobile devices, and SaaS applica0ons."
[Validate] Present? True

           VALIDATION END

REQ: The policy must include an incident response process.
PRESENT: True
RAW: yes: The requirement is present in the document as it mentions "Coordinates incident response activities" under 3.1 CISO's responsibilities.
FILES: {'ACME_Information_Security_Policy.pdf', 'ACME_Information_Security_Policy-checkpoint.pdf'}

REQ: The policy must define roles and responsibilities.
PRESENT: True
RAW: yes: The policy defines roles and responsibilities for the CISO, IT Security Team, and Department Managers.
FILES: {'ACME_Information_Security_Policy.pdf', 'ACME_Information_Security_Policy-checkpoint.pdf'}

REQ: The policy must describe backup and recovery.
PRESE

## 9. Flask API (`create_app`)

This function wraps the core logic into a small HTTP API:

- `POST /ingest`
  - Runs `ingest()` and returns backend + count.
- `POST /chat`
  - Expects JSON with `"question"`.
  - Runs `chat(question)` + `verify(...)`.
  - Returns answer, verdict, confidence, and docs.
- `POST /validate`
  - Expects JSON with `"requirements": [...]`.
  - Returns the result of `validate(requirements)`.

This makes the framework usable from a web UI or external services.


In [None]:
from flask import Flask, request, jsonify

def create_app():
    app = Flask(__name__)

    @app.post("/ingest")
    def _ingest():
        res = ingest()
        return jsonify(res)

    @app.post("/chat")
    def _chat():
        data = request.get_json(force=True) or {}
        q = data.get("question", "")

        r = chat(q)
        if r["context"]:
            v = verify(q, r["answer"], r["context"])
        else:
            v = {"verdict": "unsupported", "confidence": 0.0}

        return jsonify(
            {
                "answer": r["answer"],
                "verdict": v["verdict"],
                "confidence": v["confidence"],
                "docs": r["docs"],
            }
        )

    @app.post("/validate")
    def _validate():
        data = request.get_json(force=True) or {}
        reqs = data.get("requirements", [])
        res = validate(reqs)
        return jsonify({"results": res})

    return app


## 10. Command-line interface (`main`)

The `main()` function provides a simple CLI interface with subcommands:

- `docval ingest`
  - Runs ingestion and prints a JSON summary.
- `docval chat "your question"`
  - Runs `chat()` + `verify()` and prints answer + verdict + confidence + docs.
- `docval validate "req1" "req2" ...`
  - Runs `validate()` on a list of requirements and prints the results.
- `docval serve --host 0.0.0.0 --port 8000`
  - Starts the Flask app.

This is useful for quick testing without notebooks or UI.


In [None]:
def main():
    parser = argparse.ArgumentParser(prog="docval")

    sub = parser.add_subparsers(dest="cmd")

    sub.add_parser("ingest")

    p_chat = sub.add_parser("chat")
    p_chat.add_argument("question")

    p_val = sub.add_parser("validate")
    p_val.add_argument("requirements", nargs="+")

    p_serve = sub.add_parser("serve")
    p_serve.add_argument("--host", default="0.0.0.0")
    p_serve.add_argument("--port", type=int, default=8000)

    args = parser.parse_args()

    if args.cmd == "ingest":
        res = ingest()
        print(json.dumps(res, indent=2))

    elif args.cmd == "chat":
        r = chat(args.question)
        if r["context"]:
            v = verify(args.question, r["answer"], r["context"])
        else:
            v = {"verdict": "unsupported", "confidence": 0.0}
        print(
            json.dumps(
                {
                    "answer": r["answer"],
                    "verdict": v["verdict"],
                    "confidence": v["confidence"],
                    "docs": r["docs"],
                },
                indent=2,
            )
        )

    elif args.cmd == "validate":
        res = validate(args.requirements)
        print(json.dumps({"results": res}, indent=2))

    elif args.cmd == "serve":
        app = create_app()
        app.run(host=args.host, port=args.port)

    else:
        parser.print_help()


if __name__ == "__main__":
    main()
