# Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex

# **Introduction**

Retrieval-augmented generation (RAG) has introduced an innovative approach that fuses the extensive retrieval capabilities of search systems with the LLM. When implementing a RAG system, one critical parameter that governs the system’s efficiency and performance is the `chunk_size`. How does one discern the optimal chunk size for seamless retrieval? This is where LlamaIndex `Response Evaluation` comes handy. In this blogpost, we'll guide you through the steps to determine the best `chunk size` using LlamaIndex’s `Response Evaluation` module. If you're unfamiliar with the `Response` Evaluation module, we recommend reviewing its [documentation](https://docs.llamaindex.ai/en/latest/core_modules/supporting_modules/evaluation/modules.html) before proceeding.

## **Why Chunk Size Matters**

Choosing the right `chunk_size` is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

1. **Relevance and Granularity**: A small `chunk_size`, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the `similarity_top_k` setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.
2. **Response Generation Time**: As the `chunk_size` increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal `chunk_size` is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use-case and dataset.

## **Setup**

Before embarking on the experiment, we need to ensure all requisite modules are imported:

In [11]:
!pip install llama-index llama-index-embeddings-openai spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.7.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,

In [12]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
)
from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)
from llama_index.llms.openai import OpenAI

import openai
import time
openai.api_key = ''

## **Load Data**

Let’s load our document.

In [13]:
# Load Data
# reader = SimpleDirectoryReader("../data/web-software-development-1-0/", recursive=True)
reader = SimpleDirectoryReader("../data/web-software-development-1-0/13-working-with-databases-i/", recursive=True)

documents = reader.load_data()
print(len(documents))

14


## **Question Generation**

To select the right `chunk_size`, we'll compute metrics like Average Response time, Faithfulness, and Relevancy for various `chunk_sizes`. The `DatasetGenerator` will help us generate questions from the documents.

In [14]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:20]
print("Amount of documents: ", len(eval_documents))

# TODO do with more chapters, and try to instruct to create 1 question per chunk (or similar)

data_generator = DatasetGenerator.from_documents(documents)
eval_questions = data_generator.generate_questions_from_nodes(num = 40)
print("=== EVAL QUESTIONS ===")
print(eval_questions)

Amount of documents:  14


  return cls(


=== EVAL QUESTIONS ===
['What is the importance of using a database in web applications?', 'What database management system will be used in this course?', 'What are the learning objectives related to working with databases?', 'Where can you find a tutorial for SQL basics if you need a refresher?', 'How can you start using PostgreSQL according to the document?', 'What is the recommended approach for taking PostgreSQL into use for development?', 'What is the purpose of the Walking skeleton in relation to PostgreSQL?', 'What are some options for running PostgreSQL locally?', 'Name two hosted services that provide PostgreSQL as a service.', 'Why does the document strongly recommend using the first option for development when starting to use PostgreSQL?', 'What are two options for starting to use PostgreSQL as mentioned in the document?', 'What are some examples of hosted services that provide PostgreSQL databases?', 'Why does the document strongly recommend using the first option for devel

  return QueryResponseDataset(queries=queries, responses=responses_dict)


## Setting Up Evaluators

We are setting up the GPT-4 model to serve as the backbone for evaluating the responses generated during the experiment. Two evaluators, `FaithfulnessEvaluator` and `RelevancyEvaluator`, are initialised with the `service_context` .

1. **Faithfulness Evaluator** - It is useful for measuring if the response was hallucinated and measures if the response from a query engine matches any source nodes.
2. **Relevancy Evaluator** - It is useful for measuring if the query was actually answered by the response and measures if the response + source nodes match the query.

In [16]:
# We will use GPT-4 for evaluating the responses
gpt4 = OpenAI(temperature=0, model="gpt-4")

# Define service context for GPT-4 for evaluation
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)


  service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)


# Storing embeddings

In [38]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

from llama_index.core.schema import IndexNode
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SummaryIndex
from llama_index.core.retrievers import RecursiveRetriever
import os
# from tqdm.notebook import tqdm
import pickle


def build_index(docs, chunk_size, out_path: str):
    print("Chunk size: ", chunk_size)

    embed_model = OpenAIEmbedding(model="text-embedding-3-large",
                                  chunk_size=chunk_size,
                                  )
    Settings.embed_model = embed_model

    nodes = []

    splitter = SentenceSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_size/4)
    for idx, doc in enumerate(docs):
        print('Splitting: ' + str(idx))

        cur_nodes = splitter.get_nodes_from_documents([doc])
        for cur_node in cur_nodes:
            # ID will be base + parent
            file_path = doc.metadata["file_path"]
            new_node = IndexNode(
                text=cur_node.text or "None",
                index_id=str(file_path),
                metadata=doc.metadata,
                # obj=doc
            )
            nodes.append(new_node)
    print("num nodes: " + str(len(nodes)))

    # save index to disk
    if not os.path.exists(out_path):
        index = VectorStoreIndex(nodes, embed_model=embed_model)
        index.set_index_id("simple_index")
        index.storage_context.persist(f"./{out_path}")
    else:
        # rebuild storage context
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./{out_path}"
        )
        # load index
        index = load_index_from_storage(
            storage_context, index_id="simple_index", embed_model=embed_model
        )

    return index


In [18]:
chunk_sizes = [128, 256, 512, 1024, 2048]
similarities_top_k = [8, 6, 4, 2, 1]

# for chunk_size in chunk_sizes:
#     llm = OpenAI(model="gpt-3.5-turbo")
#     service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
#     vector_index = VectorStoreIndex.from_documents(
#         eval_documents, service_context=service_context
#     )
    
#     vector_index.storage_context.persist(persist_dir=f"vector_stores/openai_{chunk_size}-wsd")
#     vector_indices.append(vector_index)

## **Response Evaluation For A Chunk Size**

We evaluate each chunk_size based on 3 metrics.

1. Average Response Time.
2. Average Faithfulness.
3. Average Relevancy.

Here's a function, `evaluate_response_time_and_accuracy`, that does just that which has:

1. VectorIndex Creation.
2. Building the Query Engine**.**
3. Metrics Calculation.

In [28]:
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
def evaluate_response_time_and_accuracy(chunk_size, similarity_top_k, eval_questions=eval_questions):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.

    Parameters:
    chunk_size (int): The size of data chunks being processed.

    Returns:
    tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
    """

    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

    # Load vector_index
    vector_index = build_index(eval_documents, chunk_size, f"vector_stores/openai_{chunk_size}-wsd")

    # build query engine
    # By default, similarity_top_k is set to 2. To experiment with different values, pass it as an argument to as_query_engine()
    query_engine = vector_index.as_query_engine(similarity_top_k=similarity_top_k, embed_model=Settings.embed_model)
    num_questions = len(eval_questions)

    # Iterate over each question in eval_questions to compute metrics.
    # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
    # we're using a loop here to specifically measure response time for different chunk sizes.
    print("=========== QA pairs")
    for question in eval_questions:
        print("--Q\n", question, "\n--")
        start_time = time.time()
        response_vector = query_engine.query(question)
        # print("--A\n", response_vector, "\n--")
        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness_gpt4.evaluate_response(
            response=response_vector
        ).passing

        relevancy_result = relevancy_gpt4.evaluate_response(
            query=question, response=response_vector
        ).passing

        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

        # TODO: both response and retrieval evaluation

    print("===========")
    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

In [67]:

from llama_index.core.base.response.schema import Response

def evaluate_response_time_and_accuracy_without_rag(eval_questions=eval_questions):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.

    Parameters:
    chunk_size (int): The size of data chunks being processed.

    Returns:
    tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
    """

    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

    # By default, similarity_top_k is set to 2. To experiment with different values, pass it as an argument to as_query_engine()
    num_questions = len(eval_questions)

    # index = VectorStoreIndex(nodes=[])
    # query_engine = index.as_query_engine(llm=OpenAI())


    # Iterate over each question in eval_questions to compute metrics.
    # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
    # we're using a loop here to specifically measure response time for different chunk sizes.
    print("=========== QA pairs")
    for question in eval_questions:
        print("--Q\n", question, "\n--")
        start_time = time.time()

        response = str(OpenAI().complete(question))
        response_vector = Response(response)
        # response_vector = query_engine.query(question)

        print("--A\n", response, "\n--")
        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness_gpt4.evaluate_response(
            response=response_vector
        ).passing

        relevancy_result = relevancy_gpt4.evaluate_response(
            query=question, response=response_vector
        ).passing

        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

        # TODO: both response and retrieval evaluation

    print("===========")
    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

## **Testing Across Different Chunk Sizes**

We'll evaluate a range of chunk sizes to identify which offers the most promising metrics

In [39]:
# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
for chunk_size, similarity_top_k in zip(chunk_sizes, similarities_top_k):
    avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size, similarity_top_k, eval_questions=eval_questions)
    print(f"Chunk size {chunk_size} - Similarities_top_k {similarity_top_k} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

Chunk size:  128
Splitting: 0
Splitting: 1
Splitting: 2
Splitting: 3
Splitting: 4
Splitting: 5
Splitting: 6
Splitting: 7
Splitting: 8
Splitting: 9
Splitting: 10
Splitting: 11
Splitting: 12
Splitting: 13
num nodes: 1330
Chunk size 128 - Similarities_top_k [8, 6, 4, 2, 1] - Average Response time: 2.23s, Average Faithfulness: 1.00, Average Relevancy: 0.97
Chunk size:  256
Splitting: 0
Splitting: 1
Splitting: 2
Splitting: 3
Splitting: 4
Splitting: 5
Splitting: 6
Splitting: 7
Splitting: 8
Splitting: 9
Splitting: 10
Splitting: 11
Splitting: 12
Splitting: 13
num nodes: 574
Chunk size 256 - Similarities_top_k [8, 6, 4, 2, 1] - Average Response time: 2.10s, Average Faithfulness: 0.97, Average Relevancy: 0.93
Chunk size:  512
Splitting: 0
Splitting: 1
Splitting: 2
Splitting: 3
Splitting: 4
Splitting: 5
Splitting: 6
Splitting: 7
Splitting: 8
Splitting: 9
Splitting: 10
Splitting: 11
Splitting: 12
Splitting: 13
num nodes: 267
Chunk size 512 - Similarities_top_k [8, 6, 4, 2, 1] - Average Response ti

# Evaluating responses without RAG

## Warning
From the way that the questions are stated, it might be confusing for the LLM to provide a response to them or impossible to response.
For example:
- "What database management system will be used in this course?"
- "Why does the document strongly recommend using the first option for development when starting to use PostgreSQL?"
 
These have to be cleaned manually before running the evaluation.

In [69]:
avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy_without_rag(eval_questions=eval_questions)
print(f"Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")


--Q
 What is the importance of using a database in web applications? 
--
--A
 Using a database in web applications is important for several reasons:

1. Data storage: Databases provide a structured way to store and organize data, making it easier to retrieve and manipulate information. This is essential for web applications that need to store user information, product details, and other data.

2. Data retrieval: Databases allow for efficient retrieval of data, enabling web applications to quickly access and display information to users. This helps improve the performance and responsiveness of the application.

3. Data consistency: Databases help maintain data consistency by enforcing rules and constraints on the data stored within them. This ensures that the data remains accurate and reliable, even as the application grows and evolves.

4. Data security: Databases provide security features such as user authentication, access control, and encryption to protect sensitive information from