# Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex

# **Introduction**

Retrieval-augmented generation (RAG) has introduced an innovative approach that fuses the extensive retrieval capabilities of search systems with the LLM. When implementing a RAG system, one critical parameter that governs the system’s efficiency and performance is the `chunk_size`. How does one discern the optimal chunk size for seamless retrieval? This is where LlamaIndex `Response Evaluation` comes handy. In this blogpost, we'll guide you through the steps to determine the best `chunk size` using LlamaIndex’s `Response Evaluation` module. If you're unfamiliar with the `Response` Evaluation module, we recommend reviewing its [documentation](https://docs.llamaindex.ai/en/latest/core_modules/supporting_modules/evaluation/modules.html) before proceeding.

## **Why Chunk Size Matters**

Choosing the right `chunk_size` is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

1. **Relevance and Granularity**: A small `chunk_size`, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the `similarity_top_k` setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.
2. **Response Generation Time**: As the `chunk_size` increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal `chunk_size` is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use-case and dataset.

## **Setup**

Before embarking on the experiment, we need to ensure all requisite modules are imported:

In [11]:
!pip install llama-index llama-index-embeddings-openai spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.7.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,

In [53]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)
from llama_index.llms.openai import OpenAI

import openai
import time
openai.api_key = ''

## **Load Data**

Let’s load our document.

In [54]:
# Load Data
# reader = SimpleDirectoryReader("../data/web-software-development-1-0/", recursive=True)
document_base_path = "../data/web-software-development-1-0/"
documents_path = f"{document_base_path}13-working-with-databases-i/"
reader = SimpleDirectoryReader(documents_path, recursive=True)

documents = reader.load_data()
print(len(documents))

14


## **Question Generation**

To select the right `chunk_size`, we'll compute metrics like Average Response time, Faithfulness, and Relevancy for various `chunk_sizes`. The `DatasetGenerator` will help us generate questions from the documents.

In [55]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:20]
print("Amount of documents: ", len(eval_documents))

# data_generator = DatasetGenerator.from_documents(documents)
# eval_questions = data_generator.generate_questions_from_nodes(num = 40)

# generated from above, hardcoded to save costs
all_eval_questions = ['What is the importance of using a database in web applications?',
                      'What database management system will be used in this course?',
                      'What are the learning objectives related to working with databases?',
                      'Where can you find a tutorial for SQL basics if you need a refresher?',
                      'How can you start using PostgreSQL according to the document?',
                      'What is the recommended approach for taking PostgreSQL into use for development?',
                      #   'What is the purpose of the Walking skeleton in relation to PostgreSQL?',
                      'What are some options for running PostgreSQL locally?',
                      'Name two hosted services that provide PostgreSQL as a service.',
                      #   'Why does the document strongly recommend using the first option for development when starting to use PostgreSQL?',
                      'What are two options for starting to use PostgreSQL as mentioned in the document?',
                      'What are some examples of hosted services that provide PostgreSQL databases?',
                      #   'Why does the document strongly recommend using the first option for development?',
                      'How can you get started with ElephantSQL according to the document?',
                      "What attributes are included in the table created in the document's example using SQL?",
                      'How can you add names to the table in ElephantSQL according to the document?',
                      "What SQL query can you use to select all rows from the 'names' table in ElephantSQL?",
                      "What library is used in the document's example to access the database programatically?",
                      #   'What information is grayed out in the image of the ElephantSQL details page?',
                      "What is the purpose of the 'id' attribute in the table created in the document's example?",
                      #   'What library is used in the first example to access a PostgreSQL database in the provided code snippet?',
                      #   'How can you specify the database credentials when using Postgres.js in the provided code snippet?',
                      #   'What is the purpose of the `max: 2` parameter in the Postgres.js example?',
                      #   'In the second example, what library is used to access a PostgreSQL database?',
                      'What is the recommended alternative to Deno Postgres mentioned in the document?',
                      #   'How can you establish a connection to a PostgreSQL database using Deno Postgres in the provided code snippet?',
                      #   'What query is executed in the Deno Postgres example to retrieve data from the database?',
                      'What is the significance of having a database client when working with databases?',
                      'What is the default database client mentioned in the document for accessing a PostgreSQL database?',
                      'Where can you find a list of PostgreSQL clients for different operating systems according to the document?',
                      'What database driver is used when working with Deno and PostgreSQL in the provided document?',
                      'How can you create a database client using the Postgres.js driver?',
                      #   'In the example code provided, what SQL query is being executed to retrieve data from the database?',
                      'How does Postgres.js ensure safe query generation when constructing SQL queries?',
                      'What is the purpose of the `sql` function in the Postgres.js driver?',
                      #   'In the example code, how is the result data iterated over to print only the name property?',
                      #   'What SQL statement is used to insert data into a database in the provided document?',
                      #   'After inserting a new name into the database, how many names are present in the database according to the output?',
                      'What flag is required to be used with Deno when working with the Postgres.js driver?',
                      'How can you access the Postgres.js documentation for further details on tagged template literals?']


eval_questions = [
    'What is the significance of having a database client when working with databases?',
    'What is the default database client mentioned in the document for accessing a PostgreSQL database?',
    'Where can you find a list of PostgreSQL clients for different operating systems according to the document?',
]
print("=== EVAL QUESTIONS ===")
print(eval_questions)


Amount of documents:  14
=== EVAL QUESTIONS ===
['What is the significance of having a database client when working with databases?', 'What is the default database client mentioned in the document for accessing a PostgreSQL database?', 'Where can you find a list of PostgreSQL clients for different operating systems according to the document?']


## Setting Up Evaluators

We are setting up the GPT-4 model to serve as the backbone for evaluating the responses generated during the experiment. Two evaluators, `FaithfulnessEvaluator` and `RelevancyEvaluator`, are initialised with the `service_context` .

1. **Faithfulness Evaluator** - It is useful for measuring if the response was hallucinated and measures if the response from a query engine matches any source nodes.
2. **Relevancy Evaluator** - It is useful for measuring if the query was actually answered by the response and measures if the response + source nodes match the query.

In [56]:
from llama_index.core import Settings
# We will use GPT-4 for evaluating the responses

llm_evaluate = OpenAI(temperature=0, model="gpt-3.5-turbo")
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Define service context for llm evaluation
service_context_gpt = ServiceContext.from_defaults(llm=llm_evaluate)

# Define Faithfulness and Relevancy Evaluators
faithfulness_gpt = FaithfulnessEvaluator(service_context=service_context_gpt)
relevancy_gpt = RelevancyEvaluator(service_context=service_context_gpt)

  service_context_gpt = ServiceContext.from_defaults(llm=llm_evaluate)


# (DO NOT RUN) Debugging local embeddings

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

from llama_index.core.schema import IndexNode
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SummaryIndex
from llama_index.core.retrievers import RecursiveRetriever
import os
# from tqdm.notebook import tqdm
import pickle


def build_index_local(docs, chunk_size, out_path: str):
    print("Chunk size: ", chunk_size)

    embed_model = OpenAIEmbedding(model="text-embedding-3-large",
                                  chunk_size=chunk_size,
                                  )
    Settings.embed_model = embed_model

    nodes = []

    splitter = SentenceSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_size/4)
    for idx, doc in enumerate(docs):
        print('Splitting: ' + str(idx))

        cur_nodes = splitter.get_nodes_from_documents([doc])
        for cur_node in cur_nodes:
            # ID will be base + parent
            file_path = doc.metadata["file_path"].split(document_base_path)[1]
            new_node = IndexNode(
                text=cur_node.text or "None",
                index_id=str(file_path),
                metadata=doc.metadata,
                # obj=doc
            )
            nodes.append(new_node)
        

        # Debugging
        print(len(cur_nodes), len(str(doc)), len(str(cur_nodes[0])))
        for xyz in cur_nodes:
            print(xyz)
            print("-")
        print()
        print("----DOC-----")
        print(doc)

        print()
        print()

    print("num nodes: " + str(len(nodes)))

    service_context = ServiceContext.from_defaults(
        llm=llm_evaluate, embed_model=embed_model)

    # save index to disk
    if not os.path.exists(out_path):
        index = VectorStoreIndex(nodes, service_context=service_context)
        index.set_index_id("simple_index")
        index.storage_context.persist(f"./{out_path}")
    else:
        # rebuild storage context
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./{out_path}"
        )
        # load index
        index = load_index_from_storage(
            storage_context, index_id="simple_index", service_context=service_context
            # storage_context, index_id="simple_index", embed_model=embed_model
        )

    return index


# build_index_local(eval_documents, 1024, "Test")

# vs. Storing embeddings (Weaviate)
`docker-compose -f docker-compose.weaviate-persistent.yml up`

In [184]:

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

from llama_index.core.schema import IndexNode
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SummaryIndex
from llama_index.core.retrievers import RecursiveRetriever
import os
# from tqdm.notebook import tqdm
import pickle

import weaviate
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from IPython.display import Markdown, display

# local
client = weaviate.Client("http://localhost:8080")


def get_filepath_substring(file_path):
    return file_path.split(document_base_path)[1]


def build_index(docs, chunk_size):
    print("Chunk size: ", chunk_size)

    embed_model = OpenAIEmbedding(model="text-embedding-3-large",
                                  chunk_size=chunk_size,
                                  )
    Settings.embed_model = embed_model

    service_context = ServiceContext.from_defaults(
        llm=llm_evaluate, embed_model=embed_model)

    index_name = f"NB{chunk_size}"

    # save index to disk if does not exist
    if not client.schema.exists(index_name):
        print("Schema does not exist, rebuilding and then storing in db")
        nodes = []
        splitter = SentenceSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_size/4)
        for idx, doc in enumerate(docs):
            print('Splitting: ' + str(idx))

            cur_nodes = splitter.get_nodes_from_documents([doc])
            for cur_node in cur_nodes:
                # ID will be base + parent
                file_path = get_filepath_substring(doc.metadata["file_path"])
                new_node = IndexNode(
                    text=cur_node.text or "None",
                    index_id=str(file_path),
                    metadata=doc.metadata,
                    # obj=doc
                )
                nodes.append(new_node)

        print("num nodes: " + str(len(nodes)))

        vector_store = WeaviateVectorStore(
            weaviate_client=client, index_name=index_name
        )
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        index = VectorStoreIndex(nodes, storage_context = storage_context, service_context=service_context)
    else:
        # load index
        print("Schema exists already, load cached from database")
        vector_store = WeaviateVectorStore(
            weaviate_client=client, index_name=index_name
        )
        index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

    return index


            Please consider upgrading to the latest version. See https://weaviate.io/developers/weaviate/client-libraries/python for details.


In [190]:
# model_lst = [(128,10,True), (128,8,True), (256,6,True), (512,4,True), (1024,2,True), (2048, 1, True)]
model_lst = [(128,8,True), (256,6,True), (512,4,True)]

n_runs = len(model_lst)

answers = [[] for _ in range(n_runs)]
sources = [[] for _ in range(n_runs)]

time_scores = [[] for _ in range(n_runs)]
faithfulness_scores = [[] for _ in range(n_runs)]
relevancy_scores = [[] for _ in range(n_runs)]

faithfulness_false = [[] for _ in range(n_runs)]
relevancy_false = [[] for _ in range(n_runs)]


In [191]:
vector_indices = []

def build_all_indices():
    for chunk_size, k, is_hybrid in model_lst:
        vector_indices.append(build_index(eval_documents, chunk_size))

build_all_indices()

Chunk size:  128
Schema exists already, load cached from database
Chunk size:  256
Schema exists already, load cached from database
Chunk size:  512
Schema exists already, load cached from database


  service_context = ServiceContext.from_defaults(


## **Response Evaluation For A Chunk Size**

We evaluate each chunk_size based on 3 metrics.

1. Average Response Time.
2. Average Faithfulness.
3. Average Relevancy.

Here's a function, `evaluate_response_time_and_accuracy`, that does just that which has:

1. VectorIndex Creation.
2. Building the Query Engine**.**
3. Metrics Calculation.

In [192]:
# Helper methods

def build_query_engine(vector_index, similarity_top_k, is_hybrid):
    print("== BUILDING QUERY ENGINE ==")
    if is_hybrid:
        query_engine = vector_index.as_query_engine(
            similarity_top_k=similarity_top_k, embed_model=Settings.embed_model,
            vector_store_query_mode="hybrid", alpha=0.0  # BM25
        )
        return query_engine
    # -- VEC ONLY --
    query_engine = vector_index.as_query_engine(
        similarity_top_k=similarity_top_k, embed_model=Settings.embed_model,
    )
    return query_engine

In [194]:
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)


def evaluate_response_time_and_accuracy(chunk_size, similarity_top_k, run_i, eval_questions=eval_questions, label="default", is_hybrid=True):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses for a given chunk size.
    """

    vector_index = vector_indices[run_i]

    # Build query engine
    query_engine = build_query_engine(vector_index, similarity_top_k, is_hybrid)


    # Iterate over each question in eval_questions to compute metrics.
    # print("=========== QA pairs")
    for question in eval_questions:
        print("--Q: ", question)
        start_time = time.time()
        response_vector = query_engine.query(question)
        print("--A: ", str(response_vector))
        
        print("--sources: ")
        # get unique list of file paths of source docs
        raw_file_paths = list(set(value['file_path'] for value in response_vector.metadata.values()))
        source_file_paths = list(map(get_filepath_substring, raw_file_paths))
        print(source_file_paths)
        sources[run_i].append(source_file_paths)
        print("----------")
        # print("---documents: ")
        # print(str(response_vector.get_formatted_sources(length=1000)))
        # print("----------")


        answers[run_i].append(response_vector)

        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness_gpt.evaluate_response(
            response=response_vector
        ).passing

        relevancy_result = relevancy_gpt.evaluate_response(
            query=question, response=response_vector
        ).passing

        if not faithfulness_result:
            faithfulness_false[run_i].append((question, str(response_vector)))
        if not relevancy_result:
            relevancy_false[run_i].append((question, str(response_vector)))

        time_scores[run_i].append(elapsed_time)
        faithfulness_scores[run_i].append(faithfulness_result)
        relevancy_scores[run_i].append(relevancy_result)

        print(
            f"t={elapsed_time}, f={faithfulness_result}, r={relevancy_result}\n-------")

    print("===========")

In [67]:

# from llama_index.core.base.response.schema import Response

# Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# def evaluate_response_time_and_accuracy_without_rag(eval_questions=eval_questions):
#     """
#     Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.

#     Parameters:
#     chunk_size (int): The size of data chunks being processed.

#     Returns:
#     tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
#     """

#     total_response_time = 0
#     total_faithfulness = 0
#     total_relevancy = 0

#     # By default, similarity_top_k is set to 2. To experiment with different values, pass it as an argument to as_query_engine()
#     num_questions = len(eval_questions)

#     # index = VectorStoreIndex(nodes=[])
#     # query_engine = index.as_query_engine(llm=OpenAI())


#     # Iterate over each question in eval_questions to compute metrics.
#     # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
#     # we're using a loop here to specifically measure response time for different chunk sizes.
#     print("=========== QA pairs")
#     for question in eval_questions:
#         print("--Q\n", question, "\n--")
#         start_time = time.time()

#         response = str(OpenAI().complete(question))
#         response_vector = Response(response)
#         # response_vector = query_engine.query(question)

#         print("--A\n", response, "\n--")
#         elapsed_time = time.time() - start_time

#         faithfulness_result = faithfulness_gpt.evaluate_response(
#             response=response_vector
#         ).passing

#         relevancy_result = relevancy_gpt.evaluate_response(
#             query=question, response=response_vector
#         ).passing

#         total_response_time += elapsed_time
#         total_faithfulness += faithfulness_result
#         total_relevancy += relevancy_result

#         # TODO: both response and retrieval evaluation

#     print("===========")
#     average_response_time = total_response_time / num_questions
#     average_faithfulness = total_faithfulness / num_questions
#     average_relevancy = total_relevancy / num_questions

#     return average_response_time, average_faithfulness, average_relevancy

## **Testing Across Different Chunk Sizes**

We'll evaluate a range of chunk sizes to identify which offers the most promising metrics

In [195]:
from statistics import mean

# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
for run_i, (chunk_size, similarity_top_k, is_hybrid) in enumerate(model_lst):
    evaluate_response_time_and_accuracy(chunk_size, similarity_top_k, run_i, eval_questions=eval_questions, is_hybrid=is_hybrid)

response_model_name = Settings.llm.model
evaluation_model_name = llm_evaluate.model


== BUILDING QUERY ENGINE ==
--Q:  What is the significance of having a database client when working with databases?
--A:  Having a database client is significant when working with databases because it allows the application to establish a connection to the database, execute queries to retrieve or manipulate data, and manage transactions effectively. The database client facilitates communication between the application and the database server, enabling seamless interaction with the database through functions like connecting, querying, and closing connections. This ensures efficient data retrieval, storage, and management within the database system.
--sources: 
['13-working-with-databases-i/5-deno-postgres-querying-a-database.mdx']
----------
t=2.497727394104004, f=True, r=True
-------
--Q:  What is the default database client mentioned in the document for accessing a PostgreSQL database?
--A:  The default database client mentioned in the document for accessing a PostgreSQL database is "

In [196]:
print("============= STATS ============")
print(f"n_questions: {len(eval_questions)}")
print(f"response model: {response_model_name}")
print(f"evaluation model: {evaluation_model_name}")
print("============= MODELS ===========")

for model in range(n_runs):
    time_avg = mean(time_scores[model])
    faithfulness_avg = mean(faithfulness_scores[model])
    relevancy_avg = mean(relevancy_scores[model])
    print(
        f"(hybr-{response_model_name}-{chunk_sizes[model]}*{similarities_top_k[model]}) - avg res time: {time_avg:.2f}s, avg faithfulness: {faithfulness_avg:.2f}, avg relevancy: {relevancy_avg:.2f}")


print("============= QA ============")

for i, question in enumerate(eval_questions):
    print("========================================================")
    print(f"---Q evaluated by {evaluation_model_name}: {question}")
    for model in range(n_runs):
        print("===")
        print(
            f"(hybr-{response_model_name}-{chunk_sizes[model]}*{similarities_top_k[model]}): {answers[model][i]}")
        print(f"t={time_scores[model][i]}s f={faithfulness_scores[model][i]} r={relevancy_scores[model][i]}")
        print(f"sources: {sources[model][i]}")

n_questions: 3
response model: gpt-3.5-turbo
evaluation model: gpt-3.5-turbo
(hybr-gpt-3.5-turbo-128*10) - avg res time: 2.20s, avg faithfulness: 1.00, avg relevancy: 0.67
(hybr-gpt-3.5-turbo-128*8) - avg res time: 4.07s, avg faithfulness: 1.00, avg relevancy: 1.00
(hybr-gpt-3.5-turbo-256*6) - avg res time: 1.55s, avg faithfulness: 1.00, avg relevancy: 0.67
---Q evaluated by gpt-3.5-turbo: What is the significance of having a database client when working with databases?
===
(hybr-gpt-3.5-turbo-128*10): Having a database client is significant when working with databases because it allows the application to establish a connection to the database, execute queries to retrieve or manipulate data, and manage transactions effectively. The database client facilitates communication between the application and the database server, enabling seamless interaction with the database through functions like connecting, querying, and closing connections. This ensures efficient data retrieval, storage, a

# Evaluating responses without RAG

## Warning
From the way that the questions are stated, it might be confusing for the LLM to provide a response to them or impossible to response.
For example:
- "What database management system will be used in this course?"
- "Why does the document strongly recommend using the first option for development when starting to use PostgreSQL?"
 
These have to be cleaned manually before running the evaluation.

In [None]:
# avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy_without_rag(eval_questions=eval_questions)
# print(f"Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")
