# Retrieval-Augmented Generation using ChromaDB

This notebook demonstrates Retrieval-Augmented Generation (RAG) by an LLM based on a vector DB with private data.

## Contents

- [Motivation](#Motivation--What-is-RAG-and-why-do-we-need-it?)
- [Vector DB Setup](#Vector-DB-Setup)
- [Local LLM-based RAG (Llama 3.1)](#Local-LLM-based-RAG-(Llama-3.1))
- [API-based RAG (OpenAI API)](#API-based-RAG:-OpenAI-API)

## Motivation - What is RAG and why do we need it?

While the last few years have demonstrated a huge progress in Generative AI, specifically related to Natural Language Processing (NLP) and transformer-based models, one of the key issues remains: Large Language Models (LLMs) like ChatGPT may suffer from **hallucinations**, providing misleading or false results. 

One way to mitigate this problem is to explicitly prompt the LLM to validate it's output based on additional data provided by an external source of data. This can happen by fact-checking the generated response or by directly enforcing generation of a response purely based on the external data source. While these approaches still exploit the general capabilities of semantic and syntactic understanding of queries and data provided via prompts, it removes the potential pitfall that the model may not have seen the data to answer the user prompt appropriately if this data is provided as an additional part of the prompt. This can reduce hallucinations.

The process of using additional information from an external data source to answer the user prompt is called **Retrieval-Augmented Generation (RAG)**. This way, we can ensure that the model has all the data it needs to provide the best possible answer. As **most of the world's data is private**, this step can significantly improve the output quality.

## System Configuration

If we want to extract valuable information from an external data source based on the user prompt, we need to identify suitable entries in this data source for the LLM to rely on. A key concept in this context is the use of **embeddings**. 

### Vector DB Setup

During setup of the system, we two major steps:

1. We process our private data so we can feed it into a vector database. For this, we need to generate a vector that represents the data. This is commonly performed by chunking the data and generating embeddings of the text data / documents.
2. We store the embeddings in the vector DB for rapid retrieval of the embeddings and the associated document.

### Retrieval-Augmented Generation

Once we have the embeddings stored in the DB, we can set up a system for processing user queries. This requires three steps:

1. When the user sends a prompt, we generate an embedding of the prompt.
2. We determine the similarity of the embedding to the embeddings stored in the vector DB (e.g., by cosine similarity) and retrieve the $\textit{n}$ most similar documents from the vector DB based on their embeddings, given that their similarity surpasses a certain threshold.
3. We then submit a query to the LLM in which we ask the LLM to explicitly answer the prompt based on the provided information from the vector DB.

Note that depending on the quality of the embeddings, the query to the DB might result in the retrieval of similar, but unrelated documents. If not properly fine-tuned, the LLM might therefore sometimes interpret the information incorrectly.

## Practical Implementation

Here, we demonstrate this by providing additional information via a local ChromaDB - other vector DBs like Pinecone or FAISS are not being tested. We will use both a local LLM (Llama 3.1) and the OpenAI API for this purpose.

### Vector DB Setup

Let's set up the vector DB. Here, we will feed the ChromaDB with exemplary data from the Wikipedia API for demonstration, but you can feed it with any information you like, like PDF files, data from APIs, database systems, etc..

**CAUTION: make sure not to overload the Wikipedia API!** Read the official [API documentation](https://www.mediawiki.org/wiki/Special:MyLanguage/API:Main_page) for further information prior to use.

In [34]:
import bs4
import requests

from tqdm import tqdm
from typing import Optional, List

In [35]:
WIKIPEDIA_API_URL = 'http://en.wikipedia.org/w/api.php'

### Method definitions

Let's define some methods to populate our text corpus by retrieving articles from Wikipedia.

In [36]:
def get_wikipedia_page_ids(search_term: str) -> List[int]:
    """
    Retrieves Wikipedia Page IDs for articles related to a search term.
    
    :param search_term: term to search for.
    
    :return: list of Page IDs.
    """
    params = {
        "action": "query",
        "list": "search",
        "srsearch": search_term,
        "format": "json"
    }
    response = requests.get(WIKIPEDIA_API_URL,
                            params=params)
    data = response.json()
    return [elem['pageid'] for elem in data['query']['search']]


def get_wikipedia_article_by_id(page_id: int) -> str:
    """
    Retrieve the full text of a Wikipedia article based on its page ID.

    :param page_id: page ID of the corresponding article.

    :return: full text of the article.
    """
    params = {
        "action": "query",
        "format": "json",
        "prop": "extracts",
        "pageids": page_id,
        "explaintext": True,
        "exlimit": "1"  # return full articles
    }
    response = requests.get(WIKIPEDIA_API_URL,
                            params=params)
    data = response.json()
    return data['query']['pages'][str(page_id)]['extract'].strip()


def get_wikipedia_articles_by_search_term(search_term: str,
                                          max_page_ids: int = 3) -> List[str]:
    """
    Retrieve n Wikipedia articles directly from a search term via the Page ID.

    :param search_term: term to search for.
    :param max_page_ids: maximal number of page IDs to retrieve.

    :return: article contents in a list.
    """
    page_ids = get_wikipedia_page_ids(search_term=search_term)

    documents = []
    for page_id in page_ids[:max_page_ids]:
        documents.append(get_wikipedia_article_by_id(page_id=page_id))

    return documents


def get_wikipedia_articles_for_multiple_search_terms(search_terms: List[str],
                                                     max_page_ids: int = 3) -> List[str]:
    """
    Retrieve n Wikipedia articles for multiple search terms.

    Note: we are not keeping track of which article title the articles here!

    :param seach_terms: list of search terms to query for.
    :param max_page_ids: number of page IDs to extract for each search term.

    :return: list of all extracted documents.
    """
    documents = []

    for search_term in search_terms:
        documents.extend(
            get_wikipedia_articles_by_search_term(search_term=search_term,
                                                  max_page_ids=max_page_ids)
        )

    return documents

### Method testing

Briefly check whether the methods yield reasonable output.

In [37]:
documents = get_wikipedia_articles_by_search_term(search_term='simulated reality',
                                                  max_page_ids=3)

In [38]:
documents[0][:100]

'A simulated reality is an approximation of reality created in a simulation, usually in a set of circ'

### Corpus Generation

Let us now populate the corpus with multiple documents. We will retrieve data from 3 pages (top 3 hits) for each search term.

In [39]:
search_terms = ['antikythera mechanism', 'simulated reality', 'large language models', 'model interpretability']
max_page_ids = 3

In [40]:
documents = get_wikipedia_articles_for_multiple_search_terms(
    search_terms=search_terms,
    max_page_ids=max_page_ids
)

Let's check if the expected number of documents was retrieved:

In [41]:
assert len(search_terms)*max_page_ids == len(documents), "mismatch of expected and retrieved documents"

### Setup of Vector DB

Now that we have retrieved some documents, let's store them in a Vector DB.

#### Splitting of Input Data

We will recursively split the text into chunks.

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter  

In [43]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=100
)

In [44]:
all_chunks = []
for document in documents:
    all_chunks.extend(text_splitter.split_text(document))  

In [45]:
len(all_chunks)

349

Note that for a production system, the chunk size and structure need to be optimized to find the right balance between low granularity (chunks too big) and low noise (chunks too small).

#### Embeddings and Vector DB Setup

We will generate `HuggingFaceEmbeddings` using `all-MiniLM-L6-v2` as model for the vector DB setup.

In [46]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Vector DB setup with local persistance:

In [None]:
vector_db = Chroma.from_texts(
    texts=all_chunks,  
    embedding=embeddings,  
    persist_directory="chroma_db"
)

Lateron, the vector DB can be loaded using:

```python
vector_db = Chroma(
    persist_directory="chroma_db",
    embedding_function=embeddings
)
```

#### Document retrieval methods

For the application of the DB in a pipeline, we can set it up as a retriever to facilitate handling. This way, we don't manually need to generate embeddings for our query, but this is handled internally by the retriever.

Theoretically, we can define aspects like the number of documents to retrieve per query or similar. For simplicity, we will use the default parameters here.

In [49]:
retriever = vector_db.as_retriever()

The retrieval function will retrieve the actual documents from the DB based on the embedding of the query:

In [50]:
def get_documents_from_db(query: str) -> List[str]:  
	"""
    Retrieve documents from the Vector DB based on a query.
    
    :param query: query.

    :return: list of documents.
    """
	docs = retriever.invoke(query)  
	data = ""  
	
	for item in list(docs):  
		data += item.page_content  
	  
	return data

Small test run:

In [51]:
get_documents_from_db('What is the antikythera mechanism?')

"The Antikythera mechanism ( AN-tik-ih-THEER-ə, US also  AN-ty-kih-) is an Ancient Greek hand-powered orrery (model of the Solar System). It is the oldest known example of an analogue computer. It could be used to predict astronomical positions and eclipses decades in advance. It could also be used to track the four-year cycle of athletic games similar to an Olympiad, the cycle of the ancient Olympic Games.The Antikythera mechanism ( AN-tik-ih-THEER-ə, US also  AN-ty-kih-) is an Ancient Greek hand-powered orrery (model of the Solar System). It is the oldest known example of an analogue computer. It could be used to predict astronomical positions and eclipses decades in advance. It could also be used to track the four-year cycle of athletic games similar to an Olympiad, the cycle of the ancient Olympic Games.=== Origin ===\nThe Antikythera mechanism is generally referred to as the first known analogue computer. The quality and complexity of the mechanism's manufacture suggests it must h

## Local LLM-based RAG (Llama 3.1)

Now that the Vector DB is fully set up, let us use it for Retrieval-Augmented Generation!

### Prompting Methods

We need to set up methods for the System Prompt (pre-prompt of the system telling it how to interact with the provided information) and the User Prompt.

In [83]:
def get_system_prompt(retrieved_documents: str):
    """
    Method for retrieving the System Prompt based on retrieved documents.
    
    :param retrieved_documents: retrieved documents from the Vector DB.

    :return: System Prompt for the LLM.
    """
    system_prompt = f"""
    INSTRUCTIONS:
    
    Please respond to the users' questions only using the provided DOCUMENT. Limit your RESPONSE to the facts recorded in the DOCUMENT.
    
    If the document does not contain the facts necessary to answer the QUESTION, respond with "This query cannot be answered given the provided information." and stop generation afterwards.
    
    You may rephrase the information you have retrieved from the DOCUMENT, e.g., by combining information from multiple text blocks.
    
    Ensure, that you avoid redundancies and duplications in your output.
    
    DOCUMENT:
    
    {retrieved_documents}

    END OF DOCUMENTS.
	"""
    
    return system_prompt


def get_user_prompt(query: str):  
    """
    Method for retrieving a User Prompt template based on the users' query.
    
    :param query: Query provided by the user.

    :return: User Prompt for the LLM.
    """
    user_prompt = f"""
    INPUT:
    
    {query}

    Please provide your RESPONSE below.
    
    RESPONSE:
    """
    
    return user_prompt

### Model Loading

Let's load the model and associated tokenizers!

In [70]:
from transformers import LlamaForCausalLM, AutoConfig, AutoTokenizer
import torch

In [71]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

In [72]:
LLAMA_PATH = 'D:/Code/notebooks/dev-notebooks/llama/llama-models/models/llama3_1/Meta-Llama-3.1-8B/gguf-quantised/Q8'

In [None]:
# create a BitsAndBytesConfig object for quantization - we will use 4-bit quantization here
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # use `load_in_8bit=True` for 8-bit precision
    quant_type="nf4"
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    LLAMA_PATH, 
    quantization_config=quantization_config
)

### Pipeline Setup

In [75]:
from transformers import pipeline  

In [76]:
TOKENIZER_PATH = 'D:/Code/notebooks/dev-notebooks/llama/llama-models/models/llama3_1/Meta-Llama-3.1-8B/hf'

In [77]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

In [78]:
# RAG Query/Pipeline Setup
pipe = pipeline(
    "text-generation",  
    model=model,
    tokenizer=tokenizer
)

#### Query Definition for RAG

Lastly, we need a method to handle the whole RAG process.

In [79]:
from time import time

In [80]:
template = "\nPrompt:\n{prompt}\n\n{answer}\n\nTotal time: {total_time}\n\n\Retrieved Context:\n{context}"  

In [81]:
def rag_query_local_llm(
    query: str,
	temperature: float = 0.1,  
	max_length: int = 1024,  
	show_context: bool = False
):
    """
    Method to process a RAG Query.

    :param query: User Query.
    :param temperature: temperature for predictions.
    :param max_length: maximal output length.
    :param show_context: show the context of the response.
    """
    start_time = time()  
	  
	# retriever  
    context = get_documents_from_db(query)
	  
	# augmented generation  
    system_message = get_system_prompt(context)  
    user_message = get_user_prompt(query)  
	  
    messages = [  
		{"role": "system", "content": system_message},  
		{"role": "user", "content": user_message},  
	]
	
    prompt = pipe.tokenizer.apply_chat_template(  
		messages,  
		tokenize=False,  
		add_generation_prompt=True  
	)
	
    terminators = [  
		pipe.tokenizer.eos_token_id,  
		pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")  
	]
	
    sequences = pipe(  
		prompt,  
		do_sample=True,  
		top_p=0.9,  
		temperature=temperature,  
		eos_token_id=terminators,  
		max_new_tokens=max_length,  
		return_full_text=False  
	)
    
    answer = sequences[0]['generated_text']
    end_time = time()
    total_time = f"{round(end_time-start_time, 2)} sec."

    return template.format(question=query, 
                           prompt=user_message, 
                           answer=answer, 
                           total_time=total_time,
                           context=context if show_context else None)

### Test Queries

Let's prompt our agent with some queries.

#### Topics on which information is provided

In [32]:
response = rag_query_local_llm(
	"What is the Antikythera mechanism?",  
	temperature=0.1,  
	max_length=128,
    show_context=True
)
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.



Prompt:

    INPUT:
    
    What is the Antikythera mechanism?
    
    RESPONSE:
    

OUTPUT:
    
    The Antikythera mechanism is an Ancient Greek hand-powered orrery (model of the Solar System). It is the oldest known example of an analogue computer. It could be used to predict astronomical positions and eclipses decades in advance. It could also be used to track the four-year cycle of athletic games similar to an Olympiad, the cycle of the ancient Olympic Games.=== Origin ===
The Antikythera mechanism is generally referred to as the first known analogue computer. The quality and complexity of the mechanism's manufacture suggests it must have had undiscovered predecessors during the Hellenistic period. Its construction relied

Total time:
20.61 sec.
\Retrieved Context:
The Antikythera mechanism ( AN-tik-ih-THEER-ə, US also  AN-ty-kih-) is an Ancient Greek hand-powered orrery (model of the Solar System). It is the oldest known example of an analogue computer. It could be used to 

In [33]:
response = rag_query_local_llm(
	"Name some characteristics of a simulated reality.",  
	temperature=0.1,  
	max_length=128,
    show_context=True
)
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.



Prompt:

    INPUT:
    
    Name some characteristics of a simulated reality.
    
    RESPONSE:
    

OUTPUT:
    
    A simulated reality is an approximation of reality created in a simulation, usually in a set of circumstances in which something is engineered to appear real when it is not.
Most concepts invoking a simulated reality relate to some form of computer simulation, whether through the creation of a virtual reality that creates appearance of being in a real world, or a theoretical process like mind uploading, in which a mind could be uploaded into a computer simulation. A digital twin is a simulation of a real thing, created for purposes such as testing engineering outcomes.The simulation hypothesis proposes that what we experience as the world is actually a simulated reality, such as a computer

Total time:
21.17 sec.
\Retrieved Context:
One concept of a simulated reality, the simulation hypothesis, proposes that what we experience as our reality is actually a simulation

In [84]:
response = rag_query_local_llm(
	"What are the most important methods of model interpretability?",  
	temperature=0.1,  
	max_length=256,
    show_context=True
)
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.



Prompt:

    INPUT:
    
    What are the most important methods of model interpretability?

    Please provide your RESPONSE below.
    
    RESPONSE:
    

OUTPUT:
    
    === Interpretability ===
Scholars sometimes use the term "mechanistic interpretability" to refer to the process of reverse-engineering artificial neural networks to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for AI safety and alignment, as it may enable to identify signs of undesired behaviors such as sycophancy, deceptiveness or bias, and to better steer AI models.
Studying the interpretability of the most advanced foundation models often involves searching for an automated way to identify "features" in generative pretrained transformers. In a neural network, a feature is a pattern of neuron activations that co

#### Unavailable Information

In [82]:
response = rag_query_local_llm(
	"Tell me something about rabbits.",  
	temperature=0.1,  
	max_length=256,
    show_context=True
)
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.



Prompt:

    INPUT:
    
    Tell me something about rabbits.
    
    RESPONSE:
    

OUTPUT:
    
    Rabbits are small mammals in the family Leporidae of the order Lagomorpha (along with the hare and the pika). Oryctolagus cuniculus includes the European rabbit species and its descendants, the world's 305 breeds of domestic rabbit. Sylvilagus includes 13 wild rabbit species, among them the seven types of cottontail. The European rabbit, which has been introduced on every continent except Antarctica, is familiar throughout the world as a wild prey animal and as a domesticated form of livestock and pet. With its widespread effect on ecologies and cultures, the rabbit is, in many areas of the world, a part of daily life—as food, clothing, a companion, and as a source of artistic inspiration.
    Rabbits are small mammals in the family Leporidae of the order Lagomorpha (along with the hare and the pika). Oryctolagus cuniculus includes the European rabbit species and its descendants, th

### Evaluation of the Results

For the cases where information was successfully retrieved from the vector DB, the agent responded with the provided data in 2 out of 3 cases. For the third case, it provided an answer outside of the scope of the provided documents.

Limiting the response to the provided documents also doesn't seem to work in cases where questions outside of the scope of the vector DB are asked. Again, the agent responds with learned data instead.


It would be interesting to investigate whether a non-quantized model responds in better correspondence to the system prompt. Moreover, prompt engineering might resolve the observed issues.

## API-based RAG: OpenAI API

For RAG using the OpenAI API, we will still rely on the same Vector DB and use the embeddings as shown previously to fetch documents related to the query. However, we will then feed the retrieved documents to the OpenAI API instead of a local LLM.

Note that you will need an OpenAI API key for this which needs to be stored in a `.env` file. Specify the path below.

In [53]:
import openai

from dotenv import load_dotenv, find_dotenv
from time import time
from typing import List

### OpenAI API Key

In [29]:
env_path = find_dotenv(filename='path/to/.env', 
                       raise_error_if_not_found=True)

In [30]:
load_dotenv(env_path)

True

### Method Definition

In [57]:
openai_template = "\nPrompt:\n{prompt}\n\n{answer}\n\nTotal time: {total_time}"  

In [31]:
client = openai.OpenAI()

In [55]:
def get_system_prompt_openai() -> str:
    """
    Get the system prompt for the RAG query.

    :param documents: documents for answering the query.
    """
    return """
    You are a knowledgeable assistant that provides detailed answers based on a provided CONTEXT and a USER QUERY.
    
    You will base your answer purely on this provided CONTEXT; if the CONTEXT doesn't contain the required information to answer the USER QUERY, clearly state this and don't provide an answer.
    
    You may rephrase and condense the CONTEXT to answer the question.
    """


def get_user_prompt_openai(context: str,
                           query: str) -> str:
    """
    Generate a user prompt based on the retrieved documents and the user query.
    """
    return f"""
    USER QUERY:

    {query}
    
    Here is the CONTEXT to answer the USER QUERY:

    {context}
    """


def get_chat_response(messages: List[dict],
                      max_tokens: int = 150) -> str:
    """
    Generate a chat response based on the provided message history.

    :param messages: message history (list of dictionaries).
    :param max_tokens: maximal number of tokens for the answer.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content.strip()


def rag_query_openai(
    query: str,
    max_tokens: int = 150,
    show_context: bool = True
):
    """
    Method to process a RAG Query.

    :param query: raw user query.
    :param max_tokens: maximal number of tokens for the answer.
    :param show_context: show the retrieved context.
    """
    start_time = time()
	
	# retriever
    context = get_documents_from_db(query)
	
	# augmented generation
    system_message = get_system_prompt_openai()  
    user_message = get_user_prompt_openai(context=context,
                                          query=query)

    # define prompt history
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message} 
	]

    # generate response
    answer = get_chat_response(
        messages=messages,
        max_tokens=max_tokens
    )

    total_time = time() - start_time
    
    return openai_template.format(prompt=user_message,
                                  answer=answer, 
                                  total_time=total_time,
                                  context=context if show_context else None)

### Test Queries

Let's run some test queries.

#### Questions within Vector DB Scope

In [56]:
response = rag_query_openai(
	query="What are the most important methods of model interpretability?",
    show_context=True
)
print(response)


Prompt:

    USER QUERY:

    What are the most important methods of model interpretability?
    
    Here is the CONTEXT to answer the USER QUERY:

    === Interpretability ===
Scholars sometimes use the term "mechanistic interpretability" to refer to the process of reverse-engineering artificial neural networks to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.
Interpretability research often focuses on generative pretrained transformers. It is particularly relevant for AI safety and alignment, as it may enable to identify signs of undesired behaviors such as sycophancy, deceptiveness or bias, and to better steer AI models.
Studying the interpretability of the most advanced foundation models often involves searching for an automated way to identify "features" in generative pretrained transformers. In a neural network, a feature is a pattern of neuron activations that corresponds to a concept

In [58]:
response = rag_query_openai(
	query="Name some characteristics of a simulated reality.",
    show_context=True
)
print(response)


Prompt:

    USER QUERY:

    Name some characteristics of a simulated reality.
    
    Here is the CONTEXT to answer the USER QUERY:

    One concept of a simulated reality, the simulation hypothesis, proposes that what we experience as our reality is actually a simulation within a system being operated externally to our reality.One concept of a simulated reality, the simulation hypothesis, proposes that what we experience as our reality is actually a simulation within a system being operated externally to our reality.A simulated reality is an approximation of reality created in a simulation, usually in a set of circumstances in which something is engineered to appear real when it is not.
Most concepts invoking a simulated reality relate to some form of computer simulation, whether through the creation of a virtual reality that creates appearance of being in a real world, or a theoretical process like mind uploading, in which a mind could be uploaded into a computer simulation. A di

#### Question outside of Vector DB Scope

In [61]:
response = rag_query_openai(
	query="Tell me something about rabbits.",  
    show_context=True
)
print(response)


Prompt:

    USER QUERY:

    Tell me something about rabbits.
    
    Here is the CONTEXT to answer the USER QUERY:

    real-world surroundings and may injure themselves by tripping over, or colliding with real-world objects.real-world surroundings and may injure themselves by tripping over, or colliding with real-world objects.The earliest known inhabitants (5th or 4th millennium BC) were likely seasonal hunters who traveled there to exploit the presence of migratory birds. The population of the island then changed frequently as it was settled and abandoned several times, including a period of significant influence by Cretan culture during the Bronze Age. In antiquity, the island of Antikythera was known as Aegilia or Aigilia (Αἰγιλία), Aegila or Aigila (Αἴγιλα), or Ogylos (Ὤγυλος).
Between the 4th and 1st centuries BC, it was used as a base by a group of Cilician pirates until their destruction by Pompey the Great. Their fort can still be seen atop a cliff to the northeast of the

### Evaluation of the Results

The agent provides answers as expected. It extracts text blocks from the provided documents, if these are related to the query. If no suitable documents were provided, it points this out and doesn't answer at all.

## Concluding Remarks

The performance of the quantized version of Llama 3.1 and GPT-o4-mini (used in the OpenAI API-based approach) differed significantly. While no issues related to output outside of the scope of the provided documents was observed for GPT-o4-mini, this was not the case for Llama 3.1. However, when dealing with private data, privacy concerns may have to be balanced with performance to find a suitable balance. Tackling privacy concerns may involve anonymization of input queries and provided documents and other strategies.