# 1. Description of the project

In this project, a RAG system is implemented and used in combination with LettuceDetect.

# 2. Setup

1. **Install these packages:**

In [None]:
%pip install -qq langchain langchain-unstructured langchain-chroma langchain-openai unstructured langchain-community unstructured[pdf] dotenv lettucedetect gradio ipykernel

2. **Import the necessary modules**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_unstructured import UnstructuredLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain_chroma.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain_core.documents.base import Document
import os, io, sys
from dotenv import load_dotenv, find_dotenv
from lettucedetect.models.inference import HallucinationDetector
import gradio as gr

3. **Deploy an Azure OpenAI LLM resource and embedding resource**

    Use the following link: https://ai.azure.com/
4. **Save the details to the .env file:**
    ```bash
    echo AZURE_OPENAI_API_KEY=\"your-api-key-here\" >> .env
    echo AZURE_OPENAI_API_VERSION=\"your-version-here\" >> .env
    echo AZURE_OPENAI_ENDPOINT=\"your-endpoint-here\" >> .env
    echo GPT_MODEL=\"your-llm-model-here\" >> .env
    echo EMBEDDINGS_MODEL_NAME=\"your-embeddings-model-here\" >> .env
    echo EMBEDDINGS_DEPLOYMENT=\"your-embeddings-deployment-here\" >> .env
    ```

# 3. ChromaDB setup

## 3.1 The text splitter

The text splitter divides documents into manageable chunks to optimize downstream processing and retrieval in RAG workflows.

In [None]:

def text_splitter(data, debug = False):
    """ Split the documents into chunks."""
    if debug:
        print("--- In function text_splitter ---\n"
              f"Splitting {len(data)} documents into chunks...\n"
        )

    # Split the documents into chunks of 1000 characters
    # with an overlap of 50 characters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

## 3.2 The document loader

The document loader reads and parses files from the corpus directory into structured document objects for downstream processing.

In [None]:
def load_documents(corpus_dir: str = "", debug = False):
    """Load documents from the specified corpus directory."""
    if debug:
        print("--- In function load_documents ---")
    loaded_docs = []

    # Load all documents from the specified corpus directory
    if corpus_dir:
        if debug:
            print(f"Loading documents from corpus directory: {corpus_dir}\n")
        for file in os.listdir(corpus_dir):
            if debug:
                print(f"Loading {file}...")
            loader = UnstructuredLoader(corpus_dir + file, mode = 'single')
            loaded_docs.extend(loader.load())

    # Filter complex metadata from loaded documents
    if debug:
        print("Filtering complex metadata...\n")
    filtered_docs = filter_complex_metadata(loaded_docs)

    return filtered_docs

## 3.3 The embedding client

The embedding client initializes and manages Azure OpenAI embeddings for converting text into vector representations.

In [None]:
def embeddings(debug = False):
    """ Initialize and return an Azure OpenAI embeddings client. """
    load_dotenv(find_dotenv())
    model = os.getenv('EMBEDDINGS_MODEL_NAME')
    api_key = os.getenv('AZURE_OPENAI_API_KEY')
    api_version = os.getenv("AZURE_OPENAI_API_VERSION")
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    azure_deployment = os.getenv("EMBEDDINGS_DEPLOYMENT")

    # Validate required environment variables
    if not all([model, api_key, api_version, azure_endpoint, azure_deployment]):
        raise ValueError(
            "Missing environment variables.\n \
            Please add all the required environment variables \
            in the .env file:\n \
            EMBEDDINGS_MODEL_NAME, AZURE_OPENAI_API_KEY, \
            AZURE_OPENAI_API_VERSION, AZURE_OPENAI_ENDPOINT, \
            EMBEDDINGS_DEPLOYMENT"
        )
    
    # Initialize and return an Azure OpenAI embeddings client
    if debug:
        print("--- In function embeddings ---\n"
              f"Initializing embeddings with model: {model}, "
              "deployment: {azure_deployment}\n"
        )
    embeddings = AzureOpenAIEmbeddings(
        model = model,
        api_key = api_key,
        api_version = api_version,
        azure_endpoint = azure_endpoint,
        azure_deployment = azure_deployment,
    )
    return embeddings

## 3.4 The vector database

The vector database stores document embeddings for fast similarity search and retrieval. Built with Chroma, it enables efficient access to relevant document chunks in RAG workflows.

In [None]:
def create_database(document_list, debug = False):
    """
    Initialize a database from a given corpus of documents.
    """    
    # Create a Chroma vector store from the documents
    if debug:
        print("--- In function create_database ---\n"
              f"Creating vector database"
              f"with {len(document_list)} documents...\n"
    )
        
    # Return the vector store
    return Chroma.from_documents(documents = document_list,
                                 embedding = embeddings(debug = debug),
                                 persist_directory = None,)

## 3.5 Test retriever

The retriever fetches relevant document chunks from the vector database using embeddings to match user queries with semantically similar content for efficient retrieval in RAG workflows. It is not based on LLMs, but purely on a similarity search algorithm.

In [None]:
# Load documents from the specified corpus directory
test_docs = load_documents(corpus_dir = "./example_inputs/", debug = True)
test_docs

In [None]:
# Test if the metadata attribute 'source' exists in the first document
test_docs[0].metadata['source']

In [None]:
# Split the loaded documents into chunks
test_chunks = text_splitter(test_docs, debug = True)
print(f"Number of chunks created: {len(test_chunks)}")
test_chunks

In [None]:
# Create a vector database from the chunks
test_vectordb = create_database(test_chunks, debug = True)

In [None]:
test_retriever = test_vectordb.as_retriever()
test_retriever

In [None]:
test_relevant_docs = test_retriever.get_relevant_documents("""
    Which car model provides exceptional ground clearance and capability?
""")
test_relevant_docs

In [None]:
for idx, doc in enumerate(test_relevant_docs):
    print(f"{idx}: {doc.metadata['source']}")

In [None]:
# Return the complete retrieved documents
test_sources = {doc.metadata.get('source') for doc in test_relevant_docs if 'source' in doc.metadata}
test_sources

# 4. The Hallucination detector

In [None]:
def detect_hallucinations(context, question, answer, debug = False):
    if debug:
        print("-- In function detect_hallucinations ---\n"
              f"Predicting hallucination for question: {question}"
              f"with answer: {answer}\n"
        )

    # Initialize the hallucination detector with a transformer model
    detector = HallucinationDetector(
        method="transformer",
        model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
    )

    # Predict hallucination using the detector
    result = detector.predict(context = context,
                              question = question,
                              answer = answer,
                              output_format = "spans")
    return result

# 5. The interface

In this section a GUI is created for the tests.

## 5.1 The Corpus class

In [None]:
class Corpus:
    """
    A class to handle corpus creation and file uploads.
    """

    def __init__(self, corpus_dir: str = "./corpus/", keep_files = False,
                 debug = False):
        """
        Initialize the Corpus with a directory path.

        If the directory exists, it will be emptied by default.
        """
        if debug:
            print("--- In Corpus constructor ---\n"
                  f"Creating corpus at {corpus_dir}...\n"
            )
        
        # Save the directory for later
        self.directory = corpus_dir

        # Unless specified, empty the specified corpus directory
        if not keep_files:
            if os.path.exists(self.directory):
                for f in os.listdir(self.directory):
                    os.remove(os.path.join(self.directory, f))
            os.makedirs(self.directory, exist_ok=True)

    def upload_files(self, uploaded_files, debug = False):
        """
        Save the uploaded files to the corpus directory
        """
        if debug:
            print("--- In function Retriever.upload_files ---\n"
                  f"Uploading {len(uploaded_files)} files "
                  "to corpus directory...\n"
            )

        # Save each uploaded file to the corpus directory (if any)
        for file in uploaded_files:
            filename = os.path.basename(file.name)
            src = open(file.name, "rb")
            dst = open(os.path.join(self.directory, filename), "wb")
            dst.write(src.read())

## 5.2 The Retriever class

In [None]:
class Retriever:
    """
    A retriever class to handle document retrieva from a given corpus directory.
    """
    def __init__(self, corpus: Corpus, debug = False):
        """
        Initialize the retriever with a corpus directory
        """

        if debug:
            print("--- In Retriever constructor ---\n"
                  f"Creating retriever from {corpus.directory}...\n"
            )

        # Save the corpus directory
        self.corpus = corpus

        # Using instance variables so the destructor is not called
        self.docs = load_documents(self.corpus.directory, debug = debug)
        self.chunks = text_splitter(self.docs, debug = debug)
        self.vectordb = create_database(self.chunks, debug = debug)
        self.retriever = self.vectordb.as_retriever()

    def get_full_docs(self, doc_list: list[Document],
                      debug = False) -> list[Document]:
        """ Load full documents from the specified document list. """

        if debug:
            print("--- In function get_full_docs ---\n"
                f"Loading full documents from {len(doc_list)} filepaths...\n"
            )

        # Extract the sources from the document metadata
        sources = {doc.metadata.get('source')
            for doc in doc_list
            if 'source' in doc.metadata
        }

        # For each path, add the file content to the list of full documents
        full_docs = []
        for path in sources:
            with open(path, 'r', encoding='utf-8') as file:
                full_document = Document(page_content = file.read())
                full_docs.append(full_document)
            if debug:
                print(f"Loaded document: {path}")
        return full_docs
    
    def deep_search(self, answer: str, retrieved_docs: list[Document],
                    debug = False) -> list[Document]:
        """
        Retrieve additional context documents relevant to the answer
        In the documents retrieved for the question
        """

        if debug:
            print("--- In function Retriever.deep_search ---\n"
                  f"Retrieving documents relevant to the answer: {answer}\n"
                  "From the documents retrieved for the question...\n"
            )

        # Obtain the sources from the retrieved documents
        sources = {doc.metadata.get('source')
                   for doc in retrieved_docs
                   if 'source' in doc.metadata
        }

        # Obtain the relevant documents for the answer
        retrieved_docs_answer = self.retriever.get_relevant_documents(answer)

        # Filter the retrieved documents to only include those
        # that are in the sources of the retrieved documents
        retrieved_docs_answer = [
            doc for doc in retrieved_docs_answer
            if doc.metadata.get('source') in sources
        ]

        # Join the retrieved documents with the original ones
        retrieved_docs.extend(retrieved_docs_answer)
        if debug:
            print(f"Number of documents retrieved in deep search: "
                  f"{len(retrieved_docs_answer)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs_answer]}\n"
            )
        
        # Return the retrieved documents
        return retrieved_docs

    def get_relevant_documents(self, question: str = "", answer: str = "", 
                               full_docs=False, include_answer=False,
                               deep_search=False, debug=False) -> list[Document]:
        """
        Retrieve relevant documents from the corpus
        """

        # Retrieve documents relevant to the question
        if debug:
            print("--- In function Retriever.get_relevant_documents ---\n"
                  "Retrieving relevant documents for question:\n"
                  f"{question}\n"
            )
        retrieved_docs = self.retriever.get_relevant_documents(question)
        if debug:
            print("Retrieving relevant documents for question...\n"
                  f"Number of retrieved documents: {len(retrieved_docs)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs]}\n"
            )
            
        # If deep search is enabled, retrieve documents relevant to the answer
        # in the documents retrieved for the question
        if deep_search:
            if not include_answer:
                raise ValueError(
                    "Deep search is enabled, but include_answer is False. "
                    "Please set include_answer to True to use deep search."
                )
            retrieved_docs = self.deep_search(
                answer = answer,
                retrieved_docs = retrieved_docs,
                debug = debug
            )
        
        # If deep search is disabled and include_answer is True,
        # retrieve documents relevant to the answer
        elif include_answer:
            retrieved_docs_answer = self.retriever.get_relevant_documents(answer)
            retrieved_docs = retrieved_docs + retrieved_docs_answer

        # If full_docs mode is enabled, return the complete retrieved documents
        if full_docs:
            retrieved_docs = self.get_full_docs(retrieved_docs, debug = debug)

        # Print the retrieved documents if debug mode is enabled
        if debug:
            print(f"Final number of documents: {len(retrieved_docs)}\n"
                  f"Retrieved documents: {[doc.metadata.get('source', 'Unknown')
                                           for doc in retrieved_docs]}\n"
            )

        # Return the retrieved documents
        return retrieved_docs

## 5.3 The backend function

In [None]:
def gradio_backend(uploaded_files = [], context = "", question = "", answer = "",
                   full_docs = False, include_answer = False,
                   deep_search = False, debug = False):
    """
    Gradio backend function to handle file uploads and hallucination detection.
    """

    # If debug mode is enabled, redirect the standard output to a string
    if debug:
        debug_output_stream = io.StringIO()
        original_stdout = sys.stdout
        sys.stdout = debug_output_stream
        
    # Initialize variables for the output
    print("--- In function gradio_backend ---")
    hallucination_was_found = "No hallucinations found"
    hallucination_str = ""
    error_output = gr.update(
        value = "",
        visible = False
    )
    debug_output = gr.update(
        value = "",
        visible = False
    )

    try:
        # Initialize the corpus and save the uploaded files and context
        corpus = Corpus(corpus_dir = "./temp/", debug = debug)
        if uploaded_files:
            corpus.upload_files(uploaded_files, debug = debug)
        if context:
            with open("./temp/context.txt", "w") as f:
                f.write(context)
        if not uploaded_files and not context:
            raise ValueError(
                "No files or context provided. "
                "Please upload files or provide context."
            )
            

        # Initialize the retriever with the corpus and retrieve relevant documents
        retriever = Retriever(corpus = corpus, debug = debug)
        retrieved_docs = retriever.get_relevant_documents(
            question = question,
            answer = answer,
            full_docs = full_docs,
            include_answer = include_answer,
            deep_search = deep_search,
            debug = debug
        )

        # Predict hallucination using the predict_hallucination function
        detected_hallucination = detect_hallucinations(
            context = retrieved_docs,
            question = question,
            answer = answer,
            debug = debug
        )

        # Check if an hallucination was detected
        if detected_hallucination:
            hallucination_was_found = "Hallucinations detected"

        # Create an output string based on the result
        for hallucination in detected_hallucination:
            hallucination_str += (
                f"\'{hallucination['text']}\'"
                f" - Confidence = {hallucination['confidence']}\n"
            )

    except Exception as exc:
        error_output = gr.update(
            value = str(exc),
            visible = True
        )

    finally:
        if debug:
            # Restore the original standard output
            sys.stdout = original_stdout

            # Store the debug output
            debug_output = gr.update(
                value = debug_output_stream.getvalue(),
                visible = True
            )

    return hallucination_was_found, hallucination_str, error_output, debug_output


## 5.4 The interface

In [None]:
rag_application = gr.Interface(
    fn = gradio_backend,
    allow_flagging = "never",
    inputs = [
        # Drag and drop files, returns a list of file paths
        gr.File(
            label = "Upload PDF/txt files",
            file_count = 'multiple',
            file_types = ['.pdf', '.txt']
        ),
        gr.Textbox(
            label = "Context",
            placeholder = "Provide additional context here..."
        ),
        gr.Textbox(
            label = "Prompt",
            placeholder = "Type your question here..."
        ),
        gr.Textbox(
            label = "Answer",
            lines = 3,
            placeholder = "type the answer here..."
        ),
        gr.Checkbox(
            label = "Full documents mode",
            value = False
        ),
        gr.Checkbox(
            label = "Include answer in relevance search",
            value = False
        ),
        gr.Checkbox(
            label = "Deep search",
            value = False
        ),
        gr.Checkbox(
            label = "Debug mode",
            value = False
        ),
    ],
    outputs = [
        gr.Textbox(label = "Status"),
        gr.Textbox(label = "Detected Hallucinations"),
        gr.Textbox(label = "Error Output", visible = False),
        gr.Textbox(label = "Debug Output", visible = False)
    ],
    title = "RAG system with Hallucination Detection",
    description = "Upload a collection of pdf or txt files provide a prompt \
                   and a response. The backend will try to detect \
                   hallucinations in the response based on the context."
)

rag_application.launch()

# 6. Tests

After completing the setup, several tests were carried in order to ensure the optimal performance of the system. Each one reflects a state of the project, and justifies the adjustements that were made in the code in order to improve the performance.

## Test 1

This test was done before the answer was included in the input for retrieving relevant chunks of information. Thus, only the question was useful for finding the relevant chunks of information.

### Inputs

**Question:** Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (correct):** The Audi A4 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges.

**Full docs:** False

**Include answer:** False

**Deep search:** False

### Outputs

**Status:** Hallucinations detected

**Detected hallucinations:** 'The Audi A4 2024' - Confidence = 0.9719486832618713

We retrieve the relevant documents for the query in order to understand this result:

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True)
test_retriever = Retriever(test_corpus, debug = True)
docs = test_retriever.get_relevant_documents(question = """
    Which car model features a 12.3-inch high-resolution digital instrument
    display that replaces traditional analog gauges?
    """)
docs

The problem is that the 'Audi A4 2024' part of the answer is not found in the provided context that is similar to the answer, even though they are part of the same document. There are 2 possible options:

- Finding a way to provide additional context to the detector so the missing parts are not flagged as hallucinations (done through tests 2 - 5).

- Processing a text passage marked as an hallucination once it has been flagged in order to find out if it's truly an hallucination (done in test 6).

## Test 2

The same test as before but now including the answer in the similarity search.

### Inputs

**Question:** Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (correct):** The Audi A4 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges.

**Full docs:** False

**Include answer:** True

**Deep search:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

We retrieve the relevant documents for the question and provided answer together. In this case, the relevant information relative to the "Audi A4 2024" model is included in the retrieved chunks of information.

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True)
test_retriever = Retriever(test_corpus, debug = True)
docs = test_retriever.get_relevant_documents("Which car model features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?"
                                             +
                                             "The Audi 2024 features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges."
)
docs

In the following code block, we join the retrievals for the question and answer separately in order to provide a greater context to the system.

**Note:** In the output, 8 out of 10 documents are retrieved due to the similarity of their contents. Keep in mind that all the example inputs are AI-generated car descriptions. In a scenario where the content is more diverse, the relative amount of retrieved documents compared to the whole corpus is expected to be smaller.

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True)
test_retriever = Retriever(test_corpus, debug = True)
question = """
    Which car model features a 12.3-inch high-resolution digital 
    instrument display that replaces traditional analog gauges?
"""
answer = """
    The Audi 2024 features a 12.3-inch high-resolution digital
    instrument display that replaces traditional analog gauges.
"""
test_retriever.get_relevant_documents(question, answer, include_answer = True,
                                      debug = True)

**Conclusion**

We chose to retrieve documents for the question and answer separately, then combine them, to ensure both the context of the query and the specific entities or facts in the answer are represented. This prevents missing relevant information and reduces false positives in hallucination detection. Separate retrievals maintain clarity and provide LettuceDetect with a fuller context for accurate verification.

Additionally, providing more comprehensive context reduces the risk of false negatives, as it increases the likelihood that supporting evidence for true statements will be included in the verification process.


## Test 3

In this case, we successfully trick the model by providing a wrong response but combining actual information present in the retrieved documents.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The Virtual Cockpit Technology features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** False

**Include answer:** True

**Deep search:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

## Test 4

In this test, the functionality for loading the entire documents containing the relevant context is added.

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The Virtual Cockpit Technology features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

### Conclusion 

This method seemed promising but failed because BERT is not prepared to handle long inputs.

## Test 5

In this test, we try the **deep search** strategy: first we find the relevant context found in the documents related to the question.
Next, we enrich this context with context related to the answer found in the previous documents.
This way, the information is extracted only from the relevant documents for the question, but the important information relative to the answer is not lost.
At the same time, adding information from other contexts unrelated to the question is prevented.

First, we check the retrieved context for the same question as before, this time using the deep search strategy:

In [None]:
test_corpus = Corpus(corpus_dir = "./example_inputs/", debug = True)
test_retriever = Retriever(test_corpus, debug = True)

question = (
    "Which car model features a 12.3-inch high-resolution digital"
    "instrument display that replaces traditional analog gauges?"
)
answer = (
    "The Audi 2024 features a 12.3-inch high-resolution digital"
    "instrument display that replaces traditional analog gauges."
)
          
test_retriever.get_relevant_documents(
    question = question, answer = answer, full_docs = False,
    include_answer = True, deep_search = True, debug = True
)

### Inputs

**Question:** Which car features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges?

**Answer (incorrect):** The Virtual Cockpit Technology features a 12.3-inch high-resolution digital instrument display that replaces traditional analog gauges

**Full docs:** True

**Include answer:** True

**Deep search:** False

### Outputs

**Status:** No hallucinations detected

**Detected hallucinations:**

### Conclusion

When using the deep search strategy the problem persists: even though all the context comes from the files relevant to the question, the detector is unable to differentiate when some information is inadecuate even though it's found in the context.