# Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex

# **Introduction**

Retrieval-augmented generation (RAG) has introduced an innovative approach that fuses the extensive retrieval capabilities of search systems with the LLM. When implementing a RAG system, one critical parameter that governs the system’s efficiency and performance is the `chunk_size`. How does one discern the optimal chunk size for seamless retrieval? This is where LlamaIndex `Response Evaluation` comes handy. In this blogpost, we'll guide you through the steps to determine the best `chunk size` using LlamaIndex’s `Response Evaluation` module. If you're unfamiliar with the `Response` Evaluation module, we recommend reviewing its [documentation](https://docs.llamaindex.ai/en/latest/core_modules/supporting_modules/evaluation/modules.html) before proceeding.

## **Why Chunk Size Matters**

Choosing the right `chunk_size` is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

1. **Relevance and Granularity**: A small `chunk_size`, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the `similarity_top_k` setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.
2. **Response Generation Time**: As the `chunk_size` increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal `chunk_size` is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use-case and dataset.

## **Setup**

Before embarking on the experiment, we need to ensure all requisite modules are imported:

In [10]:
!pip install llama-index llama-index-embeddings-openai spacy llama-index-embeddings-huggingface

Defaulting to user installation because normal site-packages is not writeable


In [11]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator
)
from llama_index.llms.openai import OpenAI

import openai
import time
openai.api_key = ''

## **Load Data**

Let’s load our document.

In [12]:
# Load Data
document_base_path = "../data/web-software-development/"

def get_filepath_substring(file_path):
    if document_base_path in file_path:
        return file_path.split(document_base_path)[1]
    return file_path

# documents_path = f"{document_base_path}21-working-with-databases/" # single chapter
documents_path = f"{document_base_path}" # full course

reader = SimpleDirectoryReader(documents_path, recursive=True)

documents = reader.load_data()
print(len(documents))

319


## **Question Generation**

To select the right `chunk_size`, we'll compute metrics like Average Response time, Faithfulness, and Relevancy for various `chunk_sizes`. The `DatasetGenerator` will help us generate questions from the documents.

In [13]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
# eval_documents = documents[:20]
eval_documents = documents
print("Amount of documents: ", len(eval_documents))

# data_generator = DatasetGenerator.from_documents(documents)
# eval_questions = data_generator.generate_questions_from_nodes(num = 40)

# generated from above, hardcoded to save costs


# TODO not representative of student queries
# too generic / not comprehension
# get data from chatbot


gen_eval_questions = ['what is the importance of using a database in web applications?',
                      'What database management system will be used in this course?',
                      'What are the learning objectives related to working with databases?',
                      'Where can you find a tutorial for SQL basics if you need a refresher?',
                      'How can you start using PostgreSQL according to the document?',
                      'What is the recommended approach for taking PostgreSQL into use for development?',
                      #   'What is the purpose of the Walking skeleton in relation to PostgreSQL?',
                      'What are some options for running PostgreSQL locally?',
                      'Name two hosted services that provide PostgreSQL as a service.',
                      #   'Why does the document strongly recommend using the first option for development when starting to use PostgreSQL?',
                      'What are two options for starting to use PostgreSQL as mentioned in the document?',
                      'What are some examples of hosted services that provide PostgreSQL databases?',
                      #   'Why does the document strongly recommend using the first option for development?',
                      'How can you get started with ElephantSQL according to the document?',
                      "What attributes are included in the table created in the document's example using SQL?",
                      'How can you add names to the table in ElephantSQL according to the document?',
                      "What SQL query can you use to select all rows from the 'names' table in ElephantSQL?",
                      "What library is used in the document's example to access the database programatically?",
                      #   'What information is grayed out in the image of the ElephantSQL details page?',
                      "What is the purpose of the 'id' attribute in the table created in the document's example?",
                      #   'What library is used in the first example to access a PostgreSQL database in the provided code snippet?',
                      #   'How can you specify the database credentials when using Postgres.js in the provided code snippet?',
                      #   'What is the purpose of the `max: 2` parameter in the Postgres.js example?',
                      #   'In the second example, what library is used to access a PostgreSQL database?',
                      'What is the recommended alternative to Deno Postgres mentioned in the document?',
                      #   'How can you establish a connection to a PostgreSQL database using Deno Postgres in the provided code snippet?',
                      #   'What query is executed in the Deno Postgres example to retrieve data from the database?',
                      'What is the significance of having a database client when working with databases?',
                      'What is the default database client mentioned in the document for accessing a PostgreSQL database?',
                      'Where can you find a list of PostgreSQL clients for different operating systems according to the document?',
                      'What database driver is used when working with Deno and PostgreSQL in the provided document?',
                      'How can you create a database client using the Postgres.js driver?',
                      #   'In the example code provided, what SQL query is being executed to retrieve data from the database?',
                      'How does Postgres.js ensure safe query generation when constructing SQL queries?',
                      'What is the purpose of the `sql` function in the Postgres.js driver?',
                      #   'In the example code, how is the result data iterated over to print only the name property?',
                      #   'What SQL statement is used to insert data into a database in the provided document?',
                      #   'After inserting a new name into the database, how many names are present in the database according to the output?',
                      'What flag is required to be used with Deno when working with the Postgres.js driver?',
                      'How can you access the Postgres.js documentation for further details on tagged template literals?']


eval_qa = [
    ('Why use NoSQL instead of SQL?',
     'They are designed to scale horizontally.'),

    ('What is the difference between authentication and authorization?',
     'The term authentication refers to identifying a user. The term authorization refers to the process of verifying that the user has the rights to perform the actions that the user is trying to perform.'),

    ('What is a UUID?',
     'UUIDs are a common way of identifying resources. They are 128-bit numbers that are designed for being unique without central coordination (i.e. a service that would keep track of which identifier to assign next).'),
]


def get_page_number(path):
    substr = get_filepath_substring(path)
    chapter, section = substr.split("/")
    chapter_no = chapter.split("-")[0]
    section_no = section.split("-")[0]
    return chapter_no + section_no

print("=== EVAL QUESTIONS AND ANSWERS ===")
print(eval_qa)


Amount of documents:  319
=== EVAL QUESTIONS AND ANSWERS ===
[('Why use NoSQL instead of SQL?', 'They are designed to scale horizontally.'), ('What is the difference between authentication and authorization?', 'The term authentication refers to identifying a user. The term authorization refers to the process of verifying that the user has the rights to perform the actions that the user is trying to perform.'), ('What is a UUID?', 'UUIDs are a common way of identifying resources. They are 128-bit numbers that are designed for being unique without central coordination (i.e. a service that would keep track of which identifier to assign next).')]


## Setting Up Evaluators

We are setting up the GPT-4 model to serve as the backbone for evaluating the responses generated during the experiment. Two evaluators, `FaithfulnessEvaluator` and `RelevancyEvaluator`, are initialised with the `service_context` .

1. **Faithfulness Evaluator** - It is useful for measuring if the response was hallucinated and measures if the response from a query engine matches any source nodes.
2. **Relevancy Evaluator** - It is useful for measuring if the query was actually answered by the response and measures if the response + source nodes match the query.

In [14]:
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# We will use for evaluating the responses


gpt35 = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# gpt4 = OpenAI(model="gpt-4", temperature=0.2)

roberta = HuggingFaceEmbedding(model_name="deepset/roberta-base-squad2")


def get_openai_text_embedding_3_large(chunk_size):
    return OpenAIEmbedding(model="text-embedding-3-large", chunk_size=chunk_size,)

def get_openai_text_embedding_3_small(chunk_size):
    return OpenAIEmbedding(model="text-embedding-3-small", chunk_size=chunk_size,)

def get_openai_text_embedding_ada_002(chunk_size):
    return OpenAIEmbedding(model="text-embedding-ada-002", chunk_size=chunk_size,)

def get_roberta_text_embedding():
    return roberta

Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
llm_evaluate = OpenAI(temperature=0, model="gpt-3.5-turbo")

# Define service context for llm evaluation
service_context_gpt_35 = ServiceContext.from_defaults(llm=llm_evaluate)

# Define Faithfulness and Relevancy Evaluators
faithfulness_gpt = FaithfulnessEvaluator(
    service_context=service_context_gpt_35)
relevancy_gpt = RelevancyEvaluator(service_context=service_context_gpt_35)
correctness_gpt = CorrectnessEvaluator(service_context=service_context_gpt_35)

  service_context_gpt_35 = ServiceContext.from_defaults(llm=llm_evaluate)


# (DO NOT RUN) Debugging local embeddings

In [16]:
# from llama_index.core import Settings

# from llama_index.core.schema import IndexNode
# from llama_index.core import (
#     load_index_from_storage,
#     StorageContext,
#     VectorStoreIndex,
# )
# from llama_index.core.node_parser import SentenceSplitter
# from llama_index.core import SummaryIndex
# from llama_index.core.retrievers import RecursiveRetriever
# import os
# # from tqdm.notebook import tqdm
# import pickle


# def build_index_local(docs, chunk_size, out_path: str):
#     print("Chunk size: ", chunk_size)

#     Settings.embed_model = embed_model

#     nodes = []

#     splitter = SentenceSplitter(
#         chunk_size=chunk_size, chunk_overlap=chunk_size/4)
#     for idx, doc in enumerate(docs):
#         print('Splitting: ' + str(idx))

#         cur_nodes = splitter.get_nodes_from_documents([doc])
#         for cur_node in cur_nodes:
#             # ID will be base + parent
#             file_path = doc.metadata["file_path"].split(document_base_path)[1]
#             new_node = IndexNode(
#                 text=cur_node.text or "None",
#                 index_id=str(file_path),
#                 metadata=doc.metadata,
#                 # obj=doc
#             )
#             nodes.append(new_node)
        

#         # Debugging
#         print(len(cur_nodes), len(str(doc)), len(str(cur_nodes[0])))
#         for xyz in cur_nodes:
#             print(xyz)
#             print("-")
#         print()
#         print("----DOC-----")
#         print(doc)

#         print()
#         print()

#     print("num nodes: " + str(len(nodes)))

#     service_context = ServiceContext.from_defaults(
#         llm=llm_evaluate, embed_model=embed_model)

#     # save index to disk
#     if not os.path.exists(out_path):
#         index = VectorStoreIndex(nodes, service_context=service_context)
#         index.set_index_id("simple_index")
#         index.storage_context.persist(f"./{out_path}")
#     else:
#         # rebuild storage context
#         storage_context = StorageContext.from_defaults(
#             persist_dir=f"./{out_path}"
#         )
#         # load index
#         index = load_index_from_storage(
#             storage_context, index_id="simple_index", service_context=service_context
#             # storage_context, index_id="simple_index", embed_model=embed_model
#         )

#     return index


# # build_index_local(eval_documents, 1024, "Test")

# vs. Storing embeddings (Weaviate)
`docker-compose -f docker-compose.weaviate-persistent.yml up`

In [17]:
# text-embedding-3-large with 128 chunk size -> W128te3l
# W because has to start with capital letter
def get_full_course_index(chunk_size, model_name):
    return f"W{chunk_size}{''.join([x[0] for x in model_name.split('-')])}"

In [18]:

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

from llama_index.core.schema import IndexNode
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SummaryIndex
from llama_index.core.retrievers import RecursiveRetriever
import os
# from tqdm.notebook import tqdm
import pickle

import weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from IPython.display import Markdown, display

client = weaviate.Client("http://localhost:8080")


def build_index(docs, chunk_size, embed_model):
    print("Chunk size: ", chunk_size)

    Settings.embed_model = embed_model

    service_context = ServiceContext.from_defaults(
        llm=llm_evaluate, embed_model=embed_model)

    index_name = get_full_course_index(chunk_size, embed_model.model_name)

    # save index to disk if does not exist
    if not client.schema.exists(index_name):
        print(f"Schema {index_name} does not exist, rebuilding and then storing in db")
        nodes = []
        splitter = SentenceSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_size/4)
        for idx, doc in enumerate(docs):
            print('Splitting: ' + str(idx))

            cur_nodes = splitter.get_nodes_from_documents([doc])
            for cur_node in cur_nodes:
                # ID will be base + parent
                file_path = get_filepath_substring(doc.metadata["file_path"])
                new_node = IndexNode(
                    text=cur_node.text or "None",
                    index_id=str(file_path),
                    metadata=doc.metadata,
                    # obj=doc
                )
                nodes.append(new_node)

        print("num nodes: " + str(len(nodes)))

        vector_store = WeaviateVectorStore(
            weaviate_client=client, index_name=index_name
        )
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        index = VectorStoreIndex(nodes, storage_context = storage_context, service_context=service_context)
    else:
        # load index
        print(f"Schema {index_name} exists already, load cached from database")
        vector_store = WeaviateVectorStore(
            weaviate_client=client, index_name=index_name
        )
        index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

    return index


            Please consider upgrading to the latest version. See https://weaviate.io/developers/weaviate/client-libraries/python for details.


In [33]:
# <chunk_size>, <k>, <use_hybrid>, <model>, <use_rag>

# config_lst = [(128, 10, True, gpt35, True), (128, 8, True, gpt35, True), (256, 6, True, gpt35, True),
#  (512, 4, True, gpt35, True), (1024, 2, True, gpt35, True), (2048, 1, True, gpt35, True)]

openai_3_128_large = get_openai_text_embedding_3_large(128)
# openai_3_128_small = get_openai_text_embedding_3_small(128)
openai_ada_002_128 = get_openai_text_embedding_ada_002(128)
roberta = get_roberta_text_embedding()

config_lst = [(128, 8, True, gpt35, True, openai_3_128_large), 
              (128, 8, True, gpt35, True, openai_ada_002_128), 
              (128, 8, True, gpt35, True, roberta), 
              (128, 8, False, gpt35, True, openai_3_128_large), 
              (128, 8, False, gpt35, True, openai_ada_002_128), 
              (128, 8, False, gpt35, True, roberta), 
              (128, 8, True, gpt35, False, openai_3_128_large),
              ]

n_runs = len(config_lst)

answers = [[] for _ in range(n_runs)]
sources = [[] for _ in range(n_runs)]
source_docs = [[] for _ in range(n_runs)]

time_scores = [[] for _ in range(n_runs)]
faithfulness_scores = [[] for _ in range(n_runs)]
relevancy_scores = [[] for _ in range(n_runs)]
correctness_scores = [[] for _ in range(n_runs)]

faithfulness_false = [[] for _ in range(n_runs)]
relevancy_false = [[] for _ in range(n_runs)]
correctness_false = [[] for _ in range(n_runs)]


In [34]:
vector_indices = []
query_engines = []
chat_engines = []

# Helper methods

def build_query_engine(vector_index, similarity_top_k, is_hybrid, embed_model):
    print("== BUILDING QUERY ENGINE ==")
    if is_hybrid:
        return vector_index.as_query_engine(
            similarity_top_k=similarity_top_k, embed_model=embed_model,
            vector_store_query_mode="hybrid", alpha=0.0  # BM25
        )
    # -- VEC ONLY --
    return vector_index.as_query_engine(
        similarity_top_k=similarity_top_k, embed_model=embed_model,
    )


def build_chat_engine(vector_index, similarity_top_k, is_hybrid, embed_model):
    print("== BUILDING CHAT ENGINE ==")
    if is_hybrid:
        return vector_index.as_chat_engine(
            similarity_top_k=similarity_top_k, embed_model=embed_model,
            vector_store_query_mode="hybrid", alpha=0.0  # BM25
        )
    # -- VEC ONLY --
    return vector_index.as_chat_engine(
        similarity_top_k=similarity_top_k, embed_model=embed_model,
    )

def build_all_indices():
    for run_i, (chunk_size, similarity_top_k, is_hybrid, llm, is_rag, embed_model) in enumerate(config_lst):
        vector_index = build_index(eval_documents, chunk_size, embed_model) 
        vector_indices.append(vector_index)
        query_engines.append(build_query_engine(vector_index, similarity_top_k, is_hybrid, embed_model))
        chat_engines.append(build_chat_engine(vector_index, similarity_top_k, is_hybrid, embed_model))

build_all_indices()

Chunk size:  128
Schema W128te3l exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128tea0 exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128dbs exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128te3l exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128tea0 exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128dbs exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==
Chunk size:  128
Schema W128te3l exists already, load cached from database
== BUILDING QUERY ENGINE ==
== BUILDING CHAT ENGINE ==


  service_context = ServiceContext.from_defaults(


## **Response Evaluation For A Chunk Size**

We evaluate each chunk_size based on 3 metrics.

1. Average Response Time.
2. Average Faithfulness.
3. Average Relevancy.

Here's a function, `evaluate_config`, that does just that which has:

1. VectorIndex Creation.
2. Building the Query Engine**.**
3. Metrics Calculation.

In [35]:
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
from llama_index.core.base.response.schema import Response


def evaluate_config(chunk_size, similarity_top_k, llm, run_i, embed_model, eval_qa=eval_qa, label="default", is_hybrid=True, is_rag=True, is_chat=True):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses for a given chunk size.
    """

    Settings.llm = llm
    Settings.embed_model = embed_model

    chat_engine = chat_engines[run_i]
    query_engine = query_engines[run_i]
    # Iterate over each question in eval_questions to compute metrics.
    for q, a in eval_qa:
        print("--Q: ", q)
        start_time = time.time()

        if is_rag:
            if is_chat:
                response_vector = chat_engine.chat(q)
                raw_file_paths = list(set(value['file_path']
                                        for value in response_vector.sources[0].raw_output.metadata.values()))
                source_file_paths = list(
                    map(get_filepath_substring, raw_file_paths))
                source_docs[run_i].append(response_vector.source_nodes)

            else:
                response_vector = query_engine.query(q)

                # get unique list of file paths of source docs
                if response_vector.metadata is not None:
                    print("--sources: ")
                    raw_file_paths = list(set(value['file_path']
                                            for value in response_vector.metadata.values()))
                    source_file_paths = list(
                        map(get_filepath_substring, raw_file_paths))
                else:
                    source_file_paths = []
                source_docs[run_i].append(response_vector.get_formatted_sources)


            print("--A: ", str(response_vector))
            sources[run_i].append(source_file_paths)
            print("----------")


        else: # NO RAG
            # this seems to be the case with RAG systems
            q += " Make the answer one or two sentence long."
            response_vector = None
            response = str(OpenAI().complete(q))
            print("--A: ", str(response))
            response_vector = Response(response)
            sources[run_i].append([])

        answers[run_i].append(response_vector)

        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness_gpt.evaluate_response(
            response=response_vector
        ).passing if is_rag else False

        relevancy_result = relevancy_gpt.evaluate_response(
            query=q, response=response_vector
        ).passing if is_rag else False

        if not faithfulness_result:
            faithfulness_false[run_i].append((q, str(response_vector)))
        if not relevancy_result:
            relevancy_false[run_i].append((q, str(response_vector)))

        try:
            correctness_result = correctness_gpt.evaluate(
                query=q,
                response=str(response_vector),
                reference=a,
            )
            correctness_scores[run_i].append(
                (correctness_result.score, correctness_result.feedback))
            if correctness_result.score < 2.0:
                correctness_false[run_i].append(
                    (q, str(response_vector), (correctness_result.score, correctness_result.feedback)))
        except:
            print("Exception occured")
            correctness_scores[run_i].append(None)

        time_scores[run_i].append(elapsed_time)
        faithfulness_scores[run_i].append(faithfulness_result)
        relevancy_scores[run_i].append(relevancy_result)
        # print(
        #     f"t={elapsed_time}, f={faithfulness_result}, r={relevancy_result}, c={correctness_result.score}\n-------")

    print("===========")


## **Testing Across Different Chunk Sizes**

We'll evaluate a range of chunk sizes to identify which offers the most promising metrics

In [36]:
from statistics import mean

# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
for run_i, (chunk_size, similarity_top_k, is_hybrid, llm, is_rag, embed_model) in enumerate(config_lst):
    evaluate_config(chunk_size, similarity_top_k, llm, run_i, embed_model=embed_model, eval_qa=eval_qa, is_hybrid=is_hybrid, is_rag=is_rag)

evaluation_model_name = llm_evaluate.model

--Q:  Why use NoSQL instead of SQL?
--A:  NoSQL databases are often chosen over SQL databases when dealing with unstructured or semi-structured data. They offer more flexibility in handling varying data types and structures, making them suitable for applications that require horizontal scalability and distributed data models. NoSQL databases are also preferred for real-time applications and scenarios that require rapid development and iteration.
----------
--Q:  What is the difference between authentication and authorization?
--A:  Authentication is the process of verifying the identity of a user, ensuring that the user is who they claim to be. Authorization, on the other hand, is the process of determining what actions or resources a user is allowed to access after they have been authenticated. In simpler terms, authentication confirms identity, while authorization determines permissions and access rights.
----------
Exception occured
--Q:  What is a UUID?
--A:  A UUID, or Universally

In [37]:
print("============= STATS ============")
print(f"n_questions: {len(eval_qa)}")
print(f"evaluation model: {evaluation_model_name}")
print("============= MODELS ===========")

for config_i, (chunk_size, similarity_top_k, is_hybrid, llm, is_rag, embed_model) in enumerate(config_lst):
    time_avg = mean(time_scores[config_i])
    faithfulness_avg = mean(faithfulness_scores[config_i])
    relevancy_avg = mean(relevancy_scores[config_i])
    correctness_without_none = [
        x for x in correctness_scores[config_i] if x is not None]
    correctness_avg = sum(
        elt[0] for elt in correctness_without_none)/len(correctness_without_none) if len(correctness_without_none) > 0 else -1
    print(
        f"({'hyb' if is_hybrid else 'vec'}-{llm.model}-{embed_model.model_name}-{chunk_size}*{similarity_top_k}{'-rag' if is_rag else ''}) - avg time: {time_avg:.2f}s, avg faithf: {faithfulness_avg:.2f}, avg relev: {relevancy_avg:.2f} avg correct: {correctness_avg:.2f}")


print("============= QA ============")

for i, (q, a) in enumerate(eval_qa):
    print("========================================================")
    print(
        f"---Q evaluated by {evaluation_model_name}: {q} --- ref answer: {a}")
    for config_i, (chunk_size, similarity_top_k, is_hybrid, llm, is_rag, embed_model) in enumerate(config_lst):
        print("===")
        print(
            f"({'hyb' if is_hybrid else 'vec'}-{llm.model}-{embed_model.model_name}-{chunk_size}*{similarity_top_k}{'-rag' if is_rag else ''}): {answers[config_i][i]}")
        print(
            f"t={time_scores[config_i][i]}s f={faithfulness_scores[config_i][i]} r={relevancy_scores[config_i][i]} c={correctness_scores[config_i][i]}")
        print(
            f"sources: {sources[config_i][i]}")
    print()
    print()
    print()


n_questions: 3
evaluation model: gpt-3.5-turbo
(hyb-gpt-3.5-turbo-text-embedding-3-large-128*8-rag) - avg time: 6.56s, avg faithf: 0.00, avg relev: 0.33 avg correct: 4.25
(hyb-gpt-3.5-turbo-text-embedding-ada-002-128*8-rag) - avg time: 4.00s, avg faithf: 0.00, avg relev: 0.33 avg correct: 4.50
(hyb-gpt-3.5-turbo-deepset/roberta-base-squad2-128*8-rag) - avg time: 4.23s, avg faithf: 0.00, avg relev: 0.33 avg correct: 4.50
(vec-gpt-3.5-turbo-text-embedding-3-large-128*8-rag) - avg time: 6.06s, avg faithf: 1.00, avg relev: 0.00 avg correct: 4.67
(vec-gpt-3.5-turbo-text-embedding-ada-002-128*8-rag) - avg time: 5.25s, avg faithf: 1.00, avg relev: 0.67 avg correct: 4.50
(vec-gpt-3.5-turbo-deepset/roberta-base-squad2-128*8-rag) - avg time: 6.10s, avg faithf: 0.00, avg relev: 0.33 avg correct: 4.17
(hyb-gpt-3.5-turbo-text-embedding-3-large-128*8) - avg time: 1.17s, avg faithf: 0.00, avg relev: 0.00 avg correct: -1.00
---Q evaluated by gpt-3.5-turbo: Why use NoSQL instead of SQL? --- ref answer: