# 1. Description of the project

In this project, a RAG system is implemented and used in combination with LettuceDetect.

# 2. Setup

1. **Install these packages:**

In [14]:
%pip install -qq langchain langchain-unstructured langchain-chroma langchain-openai unstructured langchain-community unstructured[pdf] dotenv lettucedetect==0.1.7 gradio ipykernel pydantic
%pip list

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Package                                  Version
---------------------------------------- -----------
accelerate                               1.9.0
aiofiles                                 24.1.0
aiohappyeyeballs                         2.6.1
aiohttp                                  3.12.15
aiosignal                                1.4.0
annotated-types                          0.7.0
antlr4-python3-runtime                   4.9.3
anyio                                    4.10.0
asttokens                                3.0.0
attrs                                    25.3.0
backoff                                  2.2.1
bcrypt                                   4.3.0
beautifulsoup4                           4.13.4
Brotli                                   1.1.0
build                                    1.3.0
cachetools                               5.5.2
certifi                                  2025.8.3
cffi                                     1.17.1
charset-normalizer                       3

2. **Import the necessary modules**

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_unstructured import UnstructuredLoader
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_chroma.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain_core.documents.base import Document
from langchain_core.prompts import ChatPromptTemplate
import os, io, sys
from dotenv import load_dotenv, find_dotenv
from lettucedetect.models.inference import HallucinationDetector
import gradio as gr
from pydantic import BaseModel, Field

  from .autonotebook import tqdm as notebook_tqdm


3. **Deploy an Azure OpenAI LLM resource and embedding resource**

    Use the following link: https://ai.azure.com/
4. **Save the details to the .env file:**
    ```bash
    echo AZURE_OPENAI_API_KEY=\"your-api-key-here\" >> .env
    echo AZURE_OPENAI_API_VERSION=\"your-version-here\" >> .env
    echo AZURE_OPENAI_ENDPOINT=\"your-endpoint-here\" >> .env
    echo GPT_MODEL=\"your-llm-model-here\" >> .env
    echo EMBEDDINGS_MODEL_NAME=\"your-embeddings-model-here\" >> .env
    echo EMBEDDINGS_DEPLOYMENT=\"your-embeddings-deployment-here\" >> .env
    ```

# 3. ChromaDB setup

## 3.1 The text splitter

The text splitter divides documents into manageable chunks to optimize downstream processing and retrieval in RAG workflows.

In [3]:

def text_splitter(data, debug = False):
    """ Split the documents into chunks."""
    if debug:
        print("--- In function text_splitter ---\n"
              f"Splitting {len(data)} documents into chunks...\n"
        )

    # Split the documents into chunks of 1000 characters
    # with an overlap of 50 characters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

## 3.2 The document loader

The document loader reads and parses files from the corpus directory into structured document objects for downstream processing.

In [4]:
def load_documents(corpus_dir: str = "", debug = False):
    """Load documents from the specified corpus directory."""
    if debug:
        print("--- In function load_documents ---")
    loaded_docs = []

    # Load all documents from the specified corpus directory
    if corpus_dir:
        if debug:
            print(f"Loading documents from corpus directory: {corpus_dir}\n")
        for file in os.listdir(corpus_dir):
            if debug:
                print(f"Loading {file}...")
            loader = UnstructuredLoader(corpus_dir + file, mode = 'single')
            loaded_docs.extend(loader.load())

    # Filter complex metadata from loaded documents
    if debug:
        print("Filtering complex metadata...\n")
    filtered_docs = filter_complex_metadata(loaded_docs)

    return filtered_docs

## 3.3 The embedding client

The embedding client initializes and manages Azure OpenAI embeddings for converting text into vector representations.

In [5]:
def embeddings(debug = False):
    """ Initialize and return an Azure OpenAI embeddings client. """
    load_dotenv(find_dotenv())
    model = os.getenv('EMBEDDINGS_MODEL_NAME')
    api_key = os.getenv('AZURE_OPENAI_API_KEY')
    api_version = os.getenv("AZURE_OPENAI_API_VERSION")
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    azure_deployment = os.getenv("EMBEDDINGS_DEPLOYMENT")

    # Validate required environment variables
    if not all([model, api_key, api_version, azure_endpoint, azure_deployment]):
        raise ValueError(
            "Missing environment variables.\n \
            Please add all the required environment variables \
            in the .env file:\n \
            EMBEDDINGS_MODEL_NAME, AZURE_OPENAI_API_KEY, \
            AZURE_OPENAI_API_VERSION, AZURE_OPENAI_ENDPOINT, \
            EMBEDDINGS_DEPLOYMENT"
        )
    
    # Initialize and return an Azure OpenAI embeddings client
    if debug:
        print("--- In function embeddings ---\n"
              f"Initializing embeddings with model: {model}, "
              "deployment: {azure_deployment}\n"
        )
    embeddings = AzureOpenAIEmbeddings(
        model = model,
        api_key = api_key,
        api_version = api_version,
        azure_endpoint = azure_endpoint,
        azure_deployment = azure_deployment,
    )
    return embeddings

## 3.4 The vector database

The vector database stores document embeddings for fast similarity search and retrieval. Built with Chroma, it enables efficient access to relevant document chunks in RAG workflows.

In [6]:
def create_database(document_list, debug = False):
    """
    Initialize a database from a given corpus of documents.
    """    
    # Create a Chroma vector store from the documents
    if debug:
        print("--- In function create_database ---\n"
              f"Creating vector database"
              f"with {len(document_list)} documents...\n"
    )
        
    # Return the vector store
    return Chroma.from_documents(documents = document_list,
                                 embedding = embeddings(debug = debug),
                                 persist_directory = None,)

# 4. The Hallucination detector

In [7]:
def detect_hallucinations(context, question, answer, debug = False):
    if debug:
        print("-- In function detect_hallucinations ---\n"
              f"Predicting hallucination for question: {question}\n"
              f"with answer: {answer}\n"
        )

    # Initialize the hallucination detector with a transformer model
    detector = HallucinationDetector(
        method="transformer",
        model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
    )

    # Predict hallucination using the detector
    result = detector.predict(context = context,
                              question = question,
                              answer = answer,
                              output_format = "spans")
    return result

# 5. The interface

In this section a GUI is created for the tests.

## 5.1 The Corpus class

In [8]:
class Corpus:
    """
    A class to handle corpus creation and file uploads.
    """

    def __init__(self, corpus_dir: str = "./corpus/", keep_files = False,
                 debug = False):
        """
        Initialize the Corpus with a directory path.

        If the directory exists, it will be emptied by default.
        """
        if debug:
            print("--- In Corpus constructor ---\n"
                  f"Creating corpus at {corpus_dir}...\n"
            )
        
        # Save the directory for later
        self.directory = corpus_dir

        # Unless specified, empty the specified corpus directory
        if not keep_files:
            if os.path.exists(self.directory):
                for f in os.listdir(self.directory):
                    os.remove(os.path.join(self.directory, f))
            os.makedirs(self.directory, exist_ok=True)

    def upload_files(self, uploaded_files, debug = False):
        """
        Save the uploaded files to the corpus directory
        """
        if debug:
            print("--- In function Retriever.upload_files ---\n"
                  f"Uploading {len(uploaded_files)} files "
                  "to corpus directory...\n"
            )

        # Save each uploaded file to the corpus directory (if any)
        for file in uploaded_files:
            filename = os.path.basename(file.name)
            src = open(file.name, "rb")
            dst = open(os.path.join(self.directory, filename), "wb")
            dst.write(src.read())

## 5.2 The Retriever class

In [9]:
class Retriever:
    """
    A retriever class to handle document retrieva from a given corpus directory.
    """
    def __init__(self, corpus: Corpus, debug = False):
        """
        Initialize the retriever with a corpus directory
        """

        if debug:
            print("--- In Retriever constructor ---\n"
                  f"Creating retriever from {corpus.directory}...\n"
            )

        # Save the corpus directory
        self.corpus = corpus

        # Using instance variables so the destructor is not called
        self.docs = load_documents(self.corpus.directory, debug = debug)
        self.chunks = text_splitter(self.docs, debug = debug)
        self.vectordb = create_database(self.chunks, debug = debug)
        self.retriever = self.vectordb.as_retriever()

    def get_full_docs(self, doc_list: list[Document],
                      debug = False) -> list[Document]:
        """ Load full documents from the specified document list. """

        if debug:
            print("--- In function get_full_docs ---\n"
                f"Loading full documents from {len(doc_list)} filepaths...\n"
            )

        # Extract the sources from the document metadata
        sources = {doc.metadata.get('source')
            for doc in doc_list
            if 'source' in doc.metadata
        }

        # For each path, add the file content to the list of full documents
        full_docs = []
        for path in sources:
            with open(path, 'r', encoding='utf-8') as file:
                full_document = Document(page_content = file.read())
                full_docs.append(full_document)
            if debug:
                print(f"Loaded document: {path}")
        return full_docs
    
    def deep_search(self, answer: str, retrieved_docs: list[Document],
                    debug = False) -> list[Document]:
        """
        Retrieve additional context documents relevant to the answer
        In the documents retrieved for the question
        """

        if debug:
            print("--- In function Retriever.deep_search ---\n"
                  f"Retrieving documents relevant to the answer: {answer}\n"
                  "From the documents retrieved for the question...\n"
            )

        # Obtain the sources from the retrieved documents
        sources = {doc.metadata.get('source')
                   for doc in retrieved_docs
                   if 'source' in doc.metadata
        }

        # Obtain the relevant documents for the answer
        retrieved_docs_answer = self.retriever.get_relevant_documents(answer)

        # Filter the retrieved documents to only include those
        # that are in the sources of the retrieved documents
        retrieved_docs_answer = [
            doc for doc in retrieved_docs_answer
            if doc.metadata.get('source') in sources
        ]

        # Join the retrieved documents with the original ones
        retrieved_docs.extend(retrieved_docs_answer)
        if debug:
            print(f"Number of documents retrieved in deep search: "
                  f"{len(retrieved_docs_answer)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs_answer]}\n"
            )
        
        # Return the retrieved documents
        return retrieved_docs

    def get_relevant_documents(self, question: str = "", answer: str = "", 
                               full_docs=False, include_answer=False,
                               deep_search=False, debug=False) -> list[Document]:
        """
        Retrieve relevant documents from the corpus
        """

        # Retrieve documents relevant to the question
        if debug:
            print("--- In function Retriever.get_relevant_documents ---\n"
                  "Retrieving relevant documents for question:\n"
                  f"{question}\n"
            )
        retrieved_docs = self.retriever.get_relevant_documents(question)
        if debug:
            print("Retrieving relevant documents for question...\n"
                  f"Number of retrieved documents: {len(retrieved_docs)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs]}\n"
            )
            
        # If deep search is enabled, retrieve documents relevant to the answer
        # in the documents retrieved for the question
        if deep_search:
            if not include_answer:
                raise ValueError(
                    "Deep search is enabled, but include_answer is False. "
                    "Please set include_answer to True to use deep search."
                )
            retrieved_docs = self.deep_search(
                answer = answer,
                retrieved_docs = retrieved_docs,
                debug = debug
            )
        
        # If deep search is disabled and include_answer is True,
        # retrieve documents relevant to the answer
        elif include_answer:
            retrieved_docs_answer = self.retriever.get_relevant_documents(answer)
            retrieved_docs = retrieved_docs + retrieved_docs_answer

        # If full_docs mode is enabled, return the complete retrieved documents
        if full_docs:
            retrieved_docs = self.get_full_docs(retrieved_docs, debug = debug)

        # Print the retrieved documents if debug mode is enabled
        if debug:
            print(f"Final number of documents: {len(retrieved_docs)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs]}\n"
            )

        # Return the retrieved documents
        return retrieved_docs

## 5.3 The LLM checker

In [10]:
class LLMVerifier:
    """
    A class to handle LLM-based hallucination checker
    """

    def __init__(self, retriever: Retriever, debug = False):
        """
        Initialize the LLMVerifier with a retriever instance.
        """
        if debug:
            print("--- In LLMVerifier constructor ---\n"
                  "Creating LLMVerifier with environment variables...\n"
            )
        self.debug = debug

        # Save the retriever instance
        self.retriever = retriever

        # Get environment variables
        load_dotenv(find_dotenv())
        api_version = os.getenv('AZURE_OPENAI_API_VERSION')
        azure_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT')
        api_key = os.getenv('AZURE_OPENAI_API_KEY')
        gpt_model = os.getenv('GPT_MODEL')
        
        # Validate required environment variables
        if not all([api_version, azure_endpoint, api_key, gpt_model]):
            raise ValueError(
                """
                Missing environment variables.
                Please load all the required environment variables in the .env file:
                AZURE_OPENAI_API_VERSION, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, GPT_MODEL
                """
            )
            
        # Initialize and return an Langchain Azure OpenAI client
        # And prepare it for returning structured outputs
        self.client = AzureChatOpenAI(
            azure_deployment = gpt_model,
            api_version=api_version,
            azure_endpoint=azure_endpoint,
            api_key=api_key,
        )
        self.structured_llm = self.client.with_structured_output(self.OutputSchema)

        # Create the verification prompt template for the LLM
        self.prompt_template = ChatPromptTemplate.from_messages([
            (
                "system",
                (
                    "Your task is to evaluate statements generated by an LLM in order to determine if they are correct or not. "
                    "For that, you are going to be given a certain context which is true. "
                    "Later you will be given a question made by a human and an answer generated by an LLM. "
                    "Your task is to determine if a certain passage of the answer is true or false based on the context. "
                    "For that, first you will reason step by step on the context and the question, and derive a conclusion about the truthfulness of the statement. "
                    "Finally, you will justifiy the reason behind your decision. "
                    "Gidelines for the evaluation: "
                    "- First, identify the position of the hallucination in the answer. "
                    "- Second, identify the role of the hallucination in the context. "
                    "- Next, reason about the correctness of the hallucinated part given the question, answer and context. Does it answer the question in a logically correct way? "
                    "- If the content is correct given the context, return True. "
                    "- If the input is corrupted, empty or malformed, or if it consists of a single character, return True. "
                    "- If the content is incorrect, return False. For the content to be considered incorrect, it must be wrong or inconsisted based on the context. "             
                    "- If the content is hallucinated, return False. For the content to be considered hallucinated, it must not be present in the context, or combine different parts of the context in a way that is not logically correct. "
                ),
            ),
            (
                "user",
                (
                    "Use the following context as true information: {context} "
                    "Given the following question: {question}, you will evaluate a part of the following ai-generated answer: {answer} "
                    "The part of the answer to evaluate is: {hallucination}."
                ),
            ),
        ])

    class OutputSchema(BaseModel):
        """
        Output schema for the LLMVerifier.
        """
        reasoning_question: str = Field(..., description = "Identify what is being asked in the question, and what does the user want to know.")
        reasoning_answer: str = Field(..., description = "Identify the position of the hallucination in the answer, as it's a part of it.")
        reasoning_context: str = Field(..., description = "Identify the relation of the hallucination with the context, and how it relates to the question and answer.")
        reasoning_answer_type_match: str = Field(..., description = "Determine whether the type of the answer matches the type of information requested in the question.")
        reasoning_correctness: str = Field(..., description = "Reason about the correctness of the hallucinated part given the question, answer and context. Does it answer the question in a logically correct way?")
        is_correct_first_thought: bool = Field(..., description = "Using the guidelines, decide if the hallucination is correct or not.")
        justification: str = Field(..., description = "Justification for the decision, based on the reasoning and guidelines.")
        is_correct: bool = Field(..., description = "Whether the hallucination is correct or not after reasoning, having a first thought and justifying the decision.")

    def is_correct(self, context: list[Document], question: str,
                         answer: str, hallucination: str):
        """
        Check the validity of the hallucination given a context using the LLM.
        """

        # Convert the context documents to a string
        # by joining the page content of each document
        context_str = set(doc.page_content for doc in context)
        if self.debug:
            print("--- In function LLMVerifier.verify_hallucination ---\n"
                  f"Verifying hallucination: {hallucination}\n"
                  f"Using context: {context_str}\n"
                  f"With question: {question}\n"
                  f"With answer: {answer}\n"
            )

        # Obtain the response from the LLM using the prompt template
        chain = self.prompt_template | self.structured_llm
        response = chain.invoke({
            "context": "\n".join(context_str),
            "answer": answer,
            "question": question,
            "hallucination": hallucination
        })

        if self.debug:
            print("LLM Response:")
            print(f"Reasoning on the question: {response.reasoning_question}")
            print(f"Reasoning on the answer: {response.reasoning_answer}")
            print(f"Reasoning on the context: {response.reasoning_context}")
            print(f"Reasoning on the answer type match: {response.reasoning_answer_type_match}")
            print(f"Reasoning on the correctness: {response.reasoning_correctness}")
            print(f"Is correct (first thought): {response.is_correct_first_thought}")
            print(f"Justification: {response.justification}")
            print(f"Is correct: {response.is_correct}")
        
        # Return whether the hallucination is detected or not
        return response.is_correct

    def verify_hallucinations(self, detected_hallucinations: list[dict],
                              question: str, answer: str) -> str:
        """
        Verify a list of detected hallucinations using the LLM.
        """
        if self.debug:
            print("--- In function LLMVerifier.verify_hallucinations ---\n"
                  f"Verifying {len(detected_hallucinations)} hallucinations "
                  f"for question: {question}\n"
                  f"And answer: {answer}\n"
            )

        # Retrieve the relevant documents for the question
        retrieved_docs_question = self.retriever.get_relevant_documents(question)
        retrieved_docs_answer = self.retriever.get_relevant_documents(answer)
        retrieved_docs = retrieved_docs_question + retrieved_docs_answer

        # Verify each hallucination using the LLM
        final_hallucinations = ""
        for hallucination in detected_hallucinations:
            # Verify each hallucination using the LLM
            is_correct = self.is_correct(
                context = retrieved_docs,
                question = question,
                answer = answer,
                hallucination = hallucination['text'],
            )

            # If the hallucination is verified, add it to the final string
            if not is_correct:
                final_hallucinations += (
                    f"\'{hallucination['text']}\'"
                    f" - Confidence = {hallucination['confidence']}\n"
                )

        # Return the final hallucinations string
        return final_hallucinations

## 5.4 The backend function

In [11]:
def gradio_backend(uploaded_files = [], context = "", question = "", answer = "",
                   full_docs = False, include_answer = False,
                   deep_search = False, llm_verification = False, debug = False):
    """
    Gradio backend function to handle file uploads and hallucination detection.
    """

    # If debug mode is enabled, redirect the standard output to a string
    if debug:
        debug_output_stream = io.StringIO()
        original_stdout = sys.stdout
        sys.stdout = debug_output_stream
        
    # Initialize variables for the output
    print("--- In function gradio_backend ---")
    hallucination_was_found = "No hallucinations found"
    hallucination_str = ""
    error_output = gr.update(
        value = "",
        visible = False
    )
    llm_output = gr.update(
        value = "",
        visible = False
    )
    debug_output = gr.update(
        value = "",
        visible = False
    )

    try:
        # Initialize the corpus and save the uploaded files and context
        corpus = Corpus(corpus_dir = "./temp/", debug = debug)
        if uploaded_files:
            corpus.upload_files(uploaded_files, debug = debug)
        if context:
            with open("./temp/context.txt", "w") as f:
                f.write(context)
        if not uploaded_files and not context:
            raise ValueError(
                "No files or context provided. "
                "Please upload files or provide context."
            )
            

        # Initialize the retriever with the corpus and retrieve relevant documents
        retriever = Retriever(corpus = corpus, debug = debug)
        retrieved_docs = retriever.get_relevant_documents(
            question = question,
            answer = answer,
            full_docs = full_docs,
            include_answer = include_answer,
            deep_search = deep_search,
            debug = debug
        )

        # Predict hallucination using the predict_hallucination function
        detected_hallucinations = detect_hallucinations(
            context = retrieved_docs,
            question = question,
            answer = answer,
            debug = debug
        )

        # Check if an hallucination was detected
        if detected_hallucinations:
            hallucination_was_found = "Hallucinations detected"

        # Create an output string based on the result
        for hallucination in detected_hallucinations:
            hallucination_str += (
                f"\'{hallucination['text']}\'"
                f" - Confidence = {hallucination['confidence']}\n"
            )

        # If LLM verification is enabled, verify the hallucinations using the LLM
        if llm_verification:
            # Initialize the LLM verifier
            llm_verifier = LLMVerifier(retriever = retriever, debug = debug)
            
            # Verify the hallucinations using the LLM
            final_hallucinations = llm_verifier.verify_hallucinations(
                detected_hallucinations = detected_hallucinations,
                question = question,
                answer = answer
            )
                    
            # Initialize the llm output with the verified hallucinations
            llm_output = gr.update(
                value = final_hallucinations,
                visible = True
            )

    except Exception as exc:
        error_output = gr.update(
            value = str(exc),
            visible = True
        )

    finally:
        if debug:
            # Restore the original standard output
            sys.stdout = original_stdout

            # Store the debug output
            debug_output = gr.update(
                value = debug_output_stream.getvalue(),
                visible = True
            )

    return hallucination_was_found, hallucination_str, error_output, llm_output, debug_output


## 5.5 The interface

In [12]:
rag_application = gr.Interface(
    fn = gradio_backend,
    allow_flagging = "never",
    inputs = [
        # Drag and drop files, returns a list of file paths
        gr.File(
            label = "Upload PDF/txt files",
            file_count = 'multiple',
            file_types = ['.pdf', '.txt']
        ),
        gr.Textbox(
            label = "Context",
            placeholder = "Provide additional context here..."
        ),
        gr.Textbox(
            label = "Prompt",
            placeholder = "Type your question here..."
        ),
        gr.Textbox(
            label = "Answer",
            lines = 3,
            placeholder = "type the answer here..."
        ),
        gr.Checkbox(
            label = "Full documents mode",
            value = False
        ),
        gr.Checkbox(
            label = "Include answer in relevance search",
            value = False
        ),
        gr.Checkbox(
            label = "Deep search",
            value = False
        ),
        gr.Checkbox(
            label = "LLM verifier",
            value = False,
        ),
        gr.Checkbox(
            label = "Debug mode",
            value = False
        ),
    ],
    outputs = [
        gr.Textbox(label = "Status"),
        gr.Textbox(label = "Detected Hallucinations"),
        gr.Textbox(label = "Error Output", visible = False),
        gr.Textbox(label = "LLM Verified Hallucinations", visible = False),
        gr.Textbox(label = "Debug Output", visible = False),
    ],
    title = "RAG system with Hallucination Detection",
    description = "Upload a collection of pdf or txt files provide a prompt \
                   and a response. The backend will try to detect \
                   hallucinations in the response based on the context."
)

rag_application.launch()



* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




# 6. Tests

After completing the setup, several tests were carried in order to ensure the optimal performance of the system. Each one reflects a state of the project, and justifies the adjustements that were made in the code in order to improve the performance.

## Test 1

This test was done before the answer was included in the input for retrieving relevant chunks of information. Thus, only the question was useful for finding the relevant chunks of information.

### Inputs

**Question:** Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (correct):** The Audi A4 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges.

**Full docs:** False

**Include answer:** False

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** Hallucinations detected

**Detected hallucinations:** 'The Audi A4 2024' - Confidence = 0.9719486832618713

We retrieve the relevant documents for the query in order to understand this result:

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True, keep_files = True)
test_retriever = Retriever(test_corpus, debug = True)
docs = test_retriever.get_relevant_documents(question = """
    Which car model features a 12.3-inch high-resolution digital instrument
    display that replaces traditional analog gauges?
    """)
docs

--- In Corpus constructor ---
Creating corpus at ./example_inputs/...

--- In Retriever constructor ---
Creating retriever from ./example_inputs/...

--- In function load_documents ---
Loading documents from corpus directory: ./example_inputs/



FileNotFoundError: [Errno 2] No such file or directory: './example_inputs/'

--- In function gradio_backend ---


INFO: Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO: HTTP Request: POST https://paulo-mcw0r95x-eastus2.cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
  retrieved_docs = self.retriever.get_relevant_documents(question)
INFO: HTTP Request: POST https://paulo-mcw0r95x-eastus2.cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
  labels = torch.tensor(labels, device=self.device)
INFO: HTTP Request: POST https://paulo-mcw0r95x-eastus2.cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://paulo-mcw0r95x-eastus2.cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
  labels = torch.t

The problem is that the 'Audi A4 2024' part of the answer is not found in the context that is similar to the question, even though they are part of the same document. In the following tests, we explore 2 different approaches to solving this issue:

- Finding a way to provide additional context to the detector so the missing parts are not flagged as hallucinations (done through tests 2 - 5).

- Processing a text passage marked as an hallucination once it has been flagged in order to find out if it's truly an hallucination (done in test 6).

## Test 2

The same test as before but now including the answer in the similarity search.

### Inputs

**Question:** Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (correct):** The Audi A4 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges.

**Full docs:** False

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

We retrieve the relevant documents for the question and provided answer together. In this case, the relevant information relative to the "Audi A4 2024" model is included in the retrieved chunks of information.

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True, keep_files = True)
test_retriever = Retriever(test_corpus, debug = True)
docs = test_retriever.get_relevant_documents("Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?"
                                             +
                                             "The Audi 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges."
)
docs

In the following code block, we join the retrievals for the question and answer separately in order to provide a greater context to the system.

**Note:** In the output, 8 out of 10 documents are retrieved due to the similarity of their contents. Keep in mind that all the example inputs are AI-generated car descriptions. In a scenario where the content is more diverse, the relative amount of retrieved documents compared to the whole corpus is expected to be smaller.

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True, keep_files = True)
test_retriever = Retriever(test_corpus, debug = True)
question = """
    Which car model features a 12.3-inch high-resolution digital 
    instrument display that replaces traditional analog gauges?
"""
answer = """
    The Audi 2024 features a 12.3-inch high-resolution digital
    instrument display that replaces traditional analog gauges.
"""
test_retriever.get_relevant_documents(question, answer, include_answer = True,
                                      debug = True)

**Conclusion**

We chose to retrieve documents for the question and answer separately, then combine them, to ensure both the context of the query and the specific entities or facts in the answer are represented. This prevents missing relevant information and reduces false positives in hallucination detection. Separate retrievals maintain clarity and provide LettuceDetect with a fuller context for accurate verification.

Additionally, providing more comprehensive context reduces the risk of false negatives, as it increases the likelihood that supporting evidence for true statements will be included in the verification process.


## Test 3

In this case, we successfully trick the model by providing a wrong response but combining actual information present in the retrieved documents.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The TFSI four-cylinder engine features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** False

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

## Test 4

In this test, the functionality for loading the entire documents containing the relevant context is added.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The TFSI four-cylinder engine features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

### Conclusion 

This method seemed promising but failed because BERT is not prepared to handle long inputs.

## Test 5

In this test, we try the **deep search** strategy: first we find the relevant context found in the documents related to the question.
Next, we enrich this context with context related to the answer found in the previous documents.
This way, the information is extracted only from the relevant documents for the question, but the important information relative to the answer is not lost.
At the same time, adding information from other contexts unrelated to the question is prevented.

First, we check the retrieved context for the same question as before, this time using the deep search strategy:

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True, keep_files = True)
test_retriever = Retriever(test_corpus, debug = True)

question = (
    "Which car model features a 12.3-inch high-resolution digital"
    "instrument display that replaces traditional analog gauges?"
)
answer = (
    "The Audi 2024 features a 12.3-inch high-resolution digital"
    "instrument display that replaces traditional analog gauges."
)
          
test_retriever.get_relevant_documents(
    question = question, answer = answer, full_docs = False,
    include_answer = True, deep_search = True, debug = True
)

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The TFSI four-cylinder engine features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** False

**Include answer:** True

**Deep search:** True

**LLM verifier:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

### Conclusion

When using the deep search strategy the problem persists: even though all the context comes from the files relevant to the question, the detector is unable to differentiate when some information is inadecuate even though it's found in the context.

## Test 6

For this test, we develop a functionality to check the text passages marked as hallucinations by using an LLM client. The strategy details are the following:

- We've noticed that adding context similar to the answer increases the number of false negatives. Thus, we will not use the **include answer** and **deep search** settings. **Full docs** setting will also be deactivated so BERT can handle all text chunks. We will try to decrease the number of false negatives as much as possible, at the expense of an increased number of false positives. Those will be handled in the following way:

- After flagging all potential hallucinations (as seen before, most of them will be false positives), we will retrieve the relevant context for both the question and answer together. As seen before, using LettuceDetect with extra context from the answer can result in an increased number of false negatives. Thus, will not use it for this purpose.

- Finally, we will provide a LLM with both the relevant context, the detected hallucination and the complete answer, and ask it to check whether it's an hallucination or not.

First, we will test it with the example that tricked the model with the previous strategies.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The TFSI four-cylinder engine features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** Hallucinations detected

**Detected Hallucinations:** 'The TFSI four-cylinder engine' - Confidence = 0.9390618801116943

**LLM Verified Hallucinations:** 'The TFSI four-cylinder engine' - Confidence = 0.9390618801116943

## Test 7

We continue testing the LLM-based approach, this time with a correct answer to see if the model correctly handles false positives

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The Audi A4 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** Hallucinations detected

**Detected Hallucinations:** 'TFSI four-cylinder engine' - Confidence = 0.7963008284568787

**LLM Verified Hallucinations:**

## Test 8

We continue testing the LLM verifier by providing an hallucinated answer, expecting the model to correctly detect it as an hallucination.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The BMW 3 Series 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

**LLM verifier:** False

### Outputs

**Status:** Hallucinations detected

**Detected Hallucinations:** 'BMW 3 Series 20' - Confidence = 0.6695796847343445

**LLM Verified Hallucinations:** 'BMW 3 Series 20' - Confidence = 0.6695796847343445