# Selecting the Best Chunking Approach with Evals

So far we've selected chunking parameters pretty arbitrarily. While the relevance was okay, we can be much smarter about it.

## Key Factors in Chunking Strategy

When selecting chunking parameters, we need to consider three critical factors:

- Hit rate: How often we find relevant documents

- Time and cost: Processing efficiency and token usage

- Relevance: Quality of retrieved information



# The Trade-off Challenge

We can easily increase hit rate by retrieving more chunks. But this results in higher time and cost.

We can decrease cost by making chunks smaller. But then we won't retrieve complete information, and our code examples will be incomplete.

We need to balance all these aspects. When we increase the size of chunks, we increase both time and cost. That's why we can use cost as a proxy for efficiency.



## Defining Our Optimization Objective

At the end, we want the method that:

- Has the highest hit rate (H)
- Has the highest relevance (R)
- Has the smallest cost (C)

Let's define our metrics:

- H = average hit rate (0-1)
- R = average relevance (0-1)
- C = average cost in cents

We can combine these into one objective function:

O = f(H, R, C)

We want to maximize H and R while minimizing C.

One option for the function f:

f = H^α × R^β / C^γ

This gives us the best hit rate and relevance we can get per unit of cost.

Since H and R are already normalized between 0 and 1, we don't need extra scaling.

## Systematic Testing Approach

Our testing strategy:

In [None]:
sizes = [1000, 2000, 3000, 5000]
steps = [1000, 2000, 3000]
top_ks = [5, 10, 15]

results = []

for size in sizes:
    for step in steps:
        for top_k in top_ks:
            evaluate_agent(size, step, top_k)


Problem: Too many options would take excessive time to test.

## Efficient Testing Strategy

How we can test efficiently:

- Testing hit rate (H) is fast and relatively inexpensive
- While testing H, we can estimate cost (C) by counting returned tokens
- Let's collect 25-50 different combinations of size, step, and top_k
- Compute the H and token count parts of our objective function
- Select 6-8 top candidates
- Run full agent evaluation on those candidates to get relevance (R)



### Implementation

Set up our evaluation framework:

In [None]:
import json
from minsearch import Index
import docs

from tqdm.auto import tqdm

Load the ground truth data:

In [None]:
import pandas as pd

df_ground_truth = pd.read_csv('../evals/ground_truth_evidently.csv')
ground_truth = df_ground_truth.to_dict(orient='records')

Load the documents:

In [None]:
github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)

Calculate the number of tokens (run uv add tiktoken to install [tiktoken](https://github.com/openai/tiktoken)):

In [None]:
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")

def calculate_num_tokens(search_results):
    json_result = json.dumps(search_results)
    num_tokens = len(encoding.encode(json_result))
    return num_tokens


We'll use the functions for calculating hit_rate and mrr from before:

In [None]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)


def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)


Put everything together:

In [None]:
def evaluate(
        ground_truth,
        search_function,
        question_column='question',
        id_column='filename'
):
    relevance_total = []
    tokens = []

    for q in ground_truth:
        doc_id = q[id_column]
        results = search_function(q[question_column])
        num_tokens = calculate_num_tokens(results)
        tokens.append(num_tokens)
        relevance = [d[id_column] == doc_id for d in results]
        relevance_total.append(relevance)

    avg_tokens = sum(tokens) / len(tokens)
    
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
        'num_tokens': avg_tokens
    }


Create the chunking evaluation function:

In [None]:
def evaluate_chunks(size, step, top_k):
    chunks = docs.chunk_documents(parsed_data, size=size, step=step)
    
    index = Index(
        text_fields=["content", "filename", "title", "description"],
    )
    
    index.fit(chunks)
    
    def search(query: str):
        return index.search(
            query=query,
            num_results=top_k,
        )

    return evaluate(ground_truth, search)


## Running Systematic Tests

Execute our parameter sweep:

In [None]:
sizes = [1000, 2000, 3000, 5000]
steps = [1000, 2000, 3000]
top_ks = [5, 10, 15]

results = []

for size in sizes:
    for step in steps:
        if step > size: 
            continue

        for top_k in top_ks:
            print(f"{size=}, {step=}, {top_k=}")
            result = evaluate_chunks(size, step, top_k)
            print(result)
            results.append((size, step, top_k, result['hit_rate'], result['num_tokens']))


## Analyzing Results

Process and score the results:

In [None]:
import pandas as pd

df = pd.DataFrame(results, columns=['size', 'step', 'top_k', 'hit_rate', 'num_tokens'])

alpha = 2
beta = 0.5
df['score'] = (df.hit_rate ** alpha) / ((df.num_tokens / 1000) ** beta)

df = df.sort_values(by='score', ascending=False)


This is our scoring function:

```text
            hit_rate ** alpha
score = ------------------------
        (num_tokens/1000) ** beta
```

This represents "retrieval quality adjusted for the cost of processing tokens".

`The numerator (hit_rate) ** 2` rewards retrieval quality non-linearly (with alpha = 2). Doubling hit rate is much more valuable than doubling cost savings. Squaring makes the metric more sensitive to improvements in hit rate.

In other words, we really value configurations that retrieve more correct chunks.

`The denominator (num_tokens) ** 0.5` penalizes cost but softer than linearly. Doubling the token cost only reduces the score by ~1.4 (with beta = 0.5).

This means we still care about efficiency, but we don't let it dominate quality.

You can play with the parameters α and β to see which values rise to the top and adjust accordingly.

Look at the top performing configurations:


In [None]:
df.head()

## Testing Top Candidates with Full Agent Evaluation

Run full evaluations to get relevance scores:

In [None]:
params = df[:5].to_dict(orient='records')

Next, we need to ajust the code to make it possible to configure our agent.

First, create AgentConfig class:

In [None]:
from dataclasses import dataclass


@dataclass
class AgentConfig:
    model_name: str = "openai:gpt-4o-mini"

    chunk_size: int = 2000
    chunk_step: int = 1000
    top_k: int = 5


Next, update all the method signatures to accept this config:

- run_full_evaluation - add Agent parameter
- prepare_search_tools - add chunk size, step and top_k parameter
- SearchTools - add top_k parameter
- [See full diff here](https://github.com/alexeygrigorev/ai-bootcamp-codespace/commit/e88074ed8a999d00be5a92add3bd0486920b31f0)

Now let's run the agent:

In [None]:
import search_agent

param_set = params[0]

config = search_agent.AgentConfig(
    chunk_size=param_set['size'],
    chunk_step=param_set['step'],
    top_k=param_set['top_k']
)

agent = search_agent.create_agent(config)


Run the complete evaluation pipeline:

In [None]:
from evals.eval_orchestrator import run_full_evaluation

await run_full_evaluation(agent, csv_path='../evals/gt-sample.csv')

Results:

```text
Evaluation Metrics:

  ✓ CheckName.tool_call_search    92.3%
  ✓ CheckName.instructions_follow 100.0%
  ✓ CheckName.instructions_avoid  100.0%
  ⚠ CheckName.answer_relevant     75.0%
  ⚠ CheckName.answer_clear        71.4%
  ⚠ CheckName.answer_match        71.4%
  ⚠ CheckName.answer_citations    71.4%
  ⚠ CheckName.completeness        71.4%
```

Now you can do this for all the top parameter sets and select the one with highest relevance (adjusted by cost).

## Final Result

At the end, relevance (R) matters more than hit rate (H) for the end user experience. So we can simplify our final scoring to:

Score = R^α / C^β

Use this score to select the best approach for your specific use case.

## Summary

Now we can use a data-driven approach for selecting the best chunking parameters. In our case, we can find the optimal balance between quality and cost for our specific application.

The framework is reusable - whenever you change your data, model, or requirements, you can re-run this analysis to find the new optimal parameters.