# ChunkFlow: Advanced Metrics Deep Dive

This notebook provides a comprehensive exploration of all 12 evaluation metrics in ChunkFlow.

## Metric Categories

ChunkFlow provides 3 categories of metrics:

1. **Retrieval Metrics (4)**: NDCG@k, Recall@k, Precision@k, MRR
2. **Semantic Metrics (4)**: Coherence, Boundary Quality, Chunk Stickiness, Topic Diversity
3. **RAG Quality Metrics (4)**: Context Relevance, Answer Faithfulness, Context Precision, Context Recall

## What You'll Learn

- How each metric works
- When to use each metric
- How to interpret metric scores
- How to create ground truth data for retrieval metrics

## Prerequisites

```bash
pip install chunk-flow[huggingface]
```

In [None]:
# Import required libraries
import asyncio
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, MetricRegistry

print("✓ All imports successful!")

## Setup: Document and Chunks

In [None]:
# Sample document about climate change
document = """
# Climate Change and Its Impact

Climate change refers to long-term shifts in global temperatures and weather patterns. 
While climate change is a natural phenomenon, scientific evidence shows that human 
activities have been the primary driver since the mid-20th century.

## Causes of Climate Change

The main cause is the greenhouse effect. Greenhouse gases like carbon dioxide (CO2), 
methane (CH4), and nitrous oxide (N2O) trap heat in the atmosphere. Human activities 
such as burning fossil fuels, deforestation, and industrial processes have dramatically 
increased these gases' concentrations.

Burning coal, oil, and gas for electricity and heat is the largest contributor, 
producing about 35% of global greenhouse gas emissions. Transportation accounts for 
another 25%, while manufacturing and construction contribute about 20%.

## Environmental Impacts

Rising temperatures cause glaciers and ice sheets to melt, leading to rising sea levels. 
The global mean sea level has risen about 8-9 inches since 1880. Ocean warming and 
acidification threaten marine ecosystems and coral reefs.

Extreme weather events are becoming more frequent and severe. These include hurricanes, 
droughts, floods, and heat waves. Such events cause billions in damage and displace 
millions of people annually.

## Biodiversity Loss

Climate change threatens countless species with extinction. Animals and plants that 
cannot adapt quickly enough to changing conditions face population decline or 
extinction. Polar bears, for example, depend on sea ice for hunting, which is 
rapidly disappearing.

## Human Health Effects

Climate change affects human health in multiple ways. Heat waves cause thousands of 
deaths annually. Changing disease patterns mean mosquito-borne illnesses like malaria 
and dengue fever are spreading to new regions.

Air quality deteriorates due to increased wildfires and smog. Crop failures and water 
scarcity threaten food security for millions, particularly in developing nations.

## Solutions and Mitigation

Addressing climate change requires global cooperation. The Paris Agreement aims to 
limit global warming to well below 2°C above pre-industrial levels. Achieving this 
requires drastic reductions in greenhouse gas emissions.

Renewable energy sources like solar, wind, and hydroelectric power must replace 
fossil fuels. Energy efficiency improvements in buildings, transportation, and 
industry are crucial. Carbon capture and storage technologies show promise.

Individual actions matter too. Reducing energy consumption, using public 
transportation, eating less meat, and supporting sustainable products all help.
"""

# Create chunker and chunk the document
chunker = StrategyRegistry.create(
    "recursive",
    {"chunk_size": 400, "overlap": 60}
)

chunk_result = await chunker.chunk(document, doc_id="climate_change")
chunks = chunk_result.chunks

print(f"Created {len(chunks)} chunks")
print(f"\nFirst 3 chunks:")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(f"{chunk[:150]}...")

In [None]:
# Generate embeddings
embedder = EmbeddingProviderFactory.create(
    "huggingface",
    {"model": "sentence-transformers/all-MiniLM-L6-v2", "normalize": True}
)

emb_result = await embedder.embed_texts(chunks)
embeddings = emb_result.embeddings

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimensions: {emb_result.dimensions}")

## Category 1: Semantic Metrics

Semantic metrics evaluate chunking quality without requiring ground truth data.

### 1.1 Semantic Coherence

**What it measures**: How semantically similar content within each chunk is.

**How it works**: Computes average pairwise similarity between sentences/segments within each chunk.

**Interpretation**: 
- Higher scores (closer to 1.0) = chunks contain related, cohesive content
- Lower scores = chunks mix unrelated topics

**When to use**: Always! This is a fundamental quality metric.

In [None]:
# Evaluate semantic coherence
pipeline = EvaluationPipeline(metrics=["semantic_coherence"])

coherence_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
)

coherence_score = coherence_result["semantic_coherence"].score
print(f"Semantic Coherence: {coherence_score:.4f}")
print(f"\nInterpretation:")
if coherence_score > 0.8:
    print("  ✓ Excellent - Chunks are highly coherent")
elif coherence_score > 0.6:
    print("  ✓ Good - Chunks maintain reasonable coherence")
elif coherence_score > 0.4:
    print("  ⚠ Fair - Some chunks may mix topics")
else:
    print("  ✗ Poor - Chunks lack coherence")

### 1.2 Boundary Quality

**What it measures**: How well chunk boundaries separate distinct topics.

**How it works**: Compares similarity between adjacent chunks vs. non-adjacent chunks.

**Interpretation**:
- Higher scores = chunks have distinct, well-separated topics
- Lower scores = adjacent chunks are too similar (poor boundaries)

**When to use**: When evaluating if boundaries occur at natural topic shifts.

In [None]:
# Evaluate boundary quality
pipeline = EvaluationPipeline(metrics=["boundary_quality"])

boundary_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
)

boundary_score = boundary_result["boundary_quality"].score
print(f"Boundary Quality: {boundary_score:.4f}")
print(f"\nInterpretation:")
if boundary_score > 0.7:
    print("  ✓ Excellent - Boundaries align with topic shifts")
elif boundary_score > 0.5:
    print("  ✓ Good - Boundaries are reasonably placed")
elif boundary_score > 0.3:
    print("  ⚠ Fair - Some boundaries split related content")
else:
    print("  ✗ Poor - Boundaries are poorly placed")

### 1.3 Chunk Stickiness (MoC - Mismatch of Chunks)

**What it measures**: How much topics "bleed" across chunk boundaries.

**How it works**: Based on research by Zhao et al. (2025), measures topic overlap between adjacent chunks.

**Interpretation**:
- **Lower scores are better** (inverted metric)
- Lower scores = cleaner boundaries, less topic bleeding
- Higher scores = topics spread across multiple chunks

**When to use**: When you want to ensure chunks are self-contained.

In [None]:
# Evaluate chunk stickiness
pipeline = EvaluationPipeline(metrics=["chunk_stickiness"])

stickiness_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
)

stickiness_score = stickiness_result["chunk_stickiness"].score
print(f"Chunk Stickiness: {stickiness_score:.4f} (lower is better)")
print(f"\nInterpretation:")
if stickiness_score < 0.3:
    print("  ✓ Excellent - Minimal topic bleeding")
elif stickiness_score < 0.5:
    print("  ✓ Good - Acceptable topic containment")
elif stickiness_score < 0.7:
    print("  ⚠ Fair - Some topic bleeding across boundaries")
else:
    print("  ✗ Poor - Significant topic bleeding")

### 1.4 Topic Diversity

**What it measures**: How diverse topics are across all chunks.

**How it works**: Measures average dissimilarity between all chunk pairs.

**Interpretation**:
- Higher scores = chunks cover diverse topics (good for broad documents)
- Lower scores = chunks cover similar topics (expected for focused documents)

**When to use**: When evaluating coverage of diverse topics in a document.

In [None]:
# Evaluate topic diversity
pipeline = EvaluationPipeline(metrics=["topic_diversity"])

diversity_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
)

diversity_score = diversity_result["topic_diversity"].score
print(f"Topic Diversity: {diversity_score:.4f}")
print(f"\nInterpretation:")
if diversity_score > 0.7:
    print("  ✓ High diversity - Chunks cover many different topics")
elif diversity_score > 0.5:
    print("  ✓ Moderate diversity - Good topic coverage")
elif diversity_score > 0.3:
    print("  ⚠ Low diversity - Chunks are similar (may be expected)")
else:
    print("  ⚠ Very low diversity - Document focuses on single topic")

## Category 2: Retrieval Metrics

Retrieval metrics evaluate how well chunks perform in a search/retrieval context. **They require ground truth data.**

### Creating Ground Truth

For demonstration, we'll create a sample query and identify which chunks are relevant.

In [None]:
# Create a sample query
query = "What are the environmental impacts of climate change on oceans and wildlife?"

# Embed the query
query_emb_result = await embedder.embed_texts([query])
query_embedding = query_emb_result.embeddings[0]

# Compute similarities to find relevant chunks
query_emb_array = np.array(query_embedding).reshape(1, -1)
embeddings_array = np.array(embeddings)
similarities = cosine_similarity(query_emb_array, embeddings_array)[0]

# Show top chunks
print(f"Query: '{query}'\n")
top_indices = np.argsort(similarities)[::-1][:5]
print("Top 5 most relevant chunks:\n")
for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. Chunk {idx} (similarity: {similarities[idx]:.4f})")
    print(f"   {chunks[idx][:100]}...\n")

# Define ground truth (chunks about environmental impacts)
# Typically chunks 2-4 discuss ocean impacts and biodiversity
relevant_indices = [i for i, sim in enumerate(similarities) if sim > 0.5]
print(f"\nGround truth: Chunks {relevant_indices} are considered relevant")

### 2.1 NDCG@k (Normalized Discounted Cumulative Gain)

**What it measures**: Ranking quality with position-based discounting.

**How it works**: Rewards relevant chunks appearing higher in results, with diminishing returns for lower positions.

**Interpretation**:
- 1.0 = Perfect ranking
- >0.7 = Excellent ranking
- <0.5 = Poor ranking

**When to use**: When ranking order matters (most retrieval scenarios).

In [None]:
# Evaluate NDCG@k
pipeline = EvaluationPipeline(metrics=["ndcg_at_k"])

# Create ground truth
ground_truth = {
    "query_embedding": query_embedding,
    "relevant_indices": relevant_indices,
    "k": 5,
}

ndcg_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

ndcg_score = ndcg_result["ndcg_at_k"].score
print(f"NDCG@5: {ndcg_score:.4f}")
print(f"\nInterpretation:")
if ndcg_score > 0.8:
    print("  ✓ Excellent - Relevant chunks ranked highly")
elif ndcg_score > 0.6:
    print("  ✓ Good - Decent ranking of relevant chunks")
elif ndcg_score > 0.4:
    print("  ⚠ Fair - Some relevant chunks ranked too low")
else:
    print("  ✗ Poor - Ranking needs improvement")

### 2.2 Recall@k

**What it measures**: What fraction of relevant chunks appear in top k results.

**How it works**: Recall@k = (# relevant chunks in top k) / (total # relevant chunks)

**Interpretation**:
- 1.0 = All relevant chunks retrieved
- 0.5 = Half of relevant chunks retrieved

**When to use**: When completeness matters (finding all relevant info).

In [None]:
# Evaluate Recall@k
pipeline = EvaluationPipeline(metrics=["recall_at_k"])

recall_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

recall_score = recall_result["recall_at_k"].score
print(f"Recall@5: {recall_score:.4f}")
print(f"\nThis means {recall_score*100:.1f}% of relevant chunks were retrieved in top 5 results")

### 2.3 Precision@k

**What it measures**: What fraction of top k results are actually relevant.

**How it works**: Precision@k = (# relevant chunks in top k) / k

**Interpretation**:
- 1.0 = All top k results are relevant
- 0.5 = Half of top k results are relevant

**When to use**: When accuracy of top results matters.

In [None]:
# Evaluate Precision@k
pipeline = EvaluationPipeline(metrics=["precision_at_k"])

precision_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

precision_score = precision_result["precision_at_k"].score
print(f"Precision@5: {precision_score:.4f}")
print(f"\nThis means {precision_score*100:.1f}% of top 5 results are relevant")

### 2.4 MRR (Mean Reciprocal Rank)

**What it measures**: How highly the first relevant chunk is ranked.

**How it works**: MRR = 1 / (rank of first relevant chunk)

**Interpretation**:
- 1.0 = First result is relevant
- 0.5 = Second result is first relevant
- 0.33 = Third result is first relevant

**When to use**: When you care about "single answer" scenarios.

In [None]:
# Evaluate MRR
pipeline = EvaluationPipeline(metrics=["mrr"])

mrr_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

mrr_score = mrr_result["mrr"].score
first_relevant_rank = int(1 / mrr_score) if mrr_score > 0 else 0
print(f"MRR: {mrr_score:.4f}")
print(f"\nThis means the first relevant chunk appears at rank {first_relevant_rank}")

## Category 3: RAG Quality Metrics (RAGAS-inspired)

These metrics evaluate chunking quality in the context of RAG (Retrieval-Augmented Generation) systems.

### 3.1 Context Relevance

**What it measures**: How relevant retrieved chunks are to a query.

**How it works**: Average similarity between query and top-k retrieved chunks.

**Interpretation**:
- Higher scores = chunks are highly relevant to query
- Lower scores = retrieved chunks may not answer the query

**When to use**: Always in RAG systems - core quality metric.

In [None]:
# Evaluate context relevance
pipeline = EvaluationPipeline(metrics=["context_relevance"])

relevance_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

relevance_score = relevance_result["context_relevance"].score
print(f"Context Relevance: {relevance_score:.4f}")
print(f"\nInterpretation:")
if relevance_score > 0.7:
    print("  ✓ Excellent - Highly relevant context retrieved")
elif relevance_score > 0.5:
    print("  ✓ Good - Relevant context retrieved")
elif relevance_score > 0.3:
    print("  ⚠ Fair - Some irrelevant context retrieved")
else:
    print("  ✗ Poor - Context not relevant to query")

### 3.2 Answer Faithfulness

**What it measures**: How self-contained and complete each chunk is.

**How it works**: Measures semantic coherence within chunks (similar to semantic coherence but RAG-focused).

**Interpretation**:
- Higher scores = chunks can stand alone as answers
- Lower scores = chunks may lack context

**When to use**: When chunks need to be self-contained for LLM context.

In [None]:
# Evaluate answer faithfulness
pipeline = EvaluationPipeline(metrics=["answer_faithfulness"])

faithfulness_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
)

faithfulness_score = faithfulness_result["answer_faithfulness"].score
print(f"Answer Faithfulness: {faithfulness_score:.4f}")
print(f"\nInterpretation:")
if faithfulness_score > 0.7:
    print("  ✓ Excellent - Chunks are self-contained")
elif faithfulness_score > 0.5:
    print("  ✓ Good - Chunks have sufficient context")
else:
    print("  ⚠ Fair - Chunks may lack context")

### 3.3 Context Precision

**What it measures**: Precision of retrieved context for RAG.

**How it works**: Similar to Precision@k but RAG-focused.

**Interpretation**:
- Higher scores = less noise in retrieved context
- Lower scores = irrelevant chunks retrieved

**When to use**: When minimizing irrelevant context matters (e.g., cost reduction).

In [None]:
# Evaluate context precision
pipeline = EvaluationPipeline(metrics=["context_precision"])

context_precision_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

context_precision_score = context_precision_result["context_precision"].score
print(f"Context Precision: {context_precision_score:.4f}")
print(f"\nThis indicates {context_precision_score*100:.1f}% of retrieved context is relevant")

### 3.4 Context Recall

**What it measures**: Recall of retrieved context for RAG.

**How it works**: Similar to Recall@k but RAG-focused.

**Interpretation**:
- Higher scores = comprehensive context retrieved
- Lower scores = missing relevant information

**When to use**: When completeness of retrieved context matters.

In [None]:
# Evaluate context recall
pipeline = EvaluationPipeline(metrics=["context_recall"])

context_recall_result = await pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

context_recall_score = context_recall_result["context_recall"].score
print(f"Context Recall: {context_recall_score:.4f}")
print(f"\nThis indicates {context_recall_score*100:.1f}% of relevant context was retrieved")

## All Metrics Summary

Let's evaluate all metrics together.

In [None]:
# Evaluate all metrics at once
all_metrics_pipeline = EvaluationPipeline(
    metrics=[
        # Semantic
        "semantic_coherence",
        "boundary_quality",
        "chunk_stickiness",
        "topic_diversity",
        # Retrieval
        "ndcg_at_k",
        "recall_at_k",
        "precision_at_k",
        "mrr",
        # RAG Quality
        "context_relevance",
        "answer_faithfulness",
        "context_precision",
        "context_recall",
    ]
)

all_results = await all_metrics_pipeline.evaluate(
    chunks=chunks,
    embeddings=embeddings,
    ground_truth=ground_truth,
)

print("\n" + "="*70)
print("ALL METRICS SUMMARY")
print("="*70)

print("\nSEMANTIC METRICS (no ground truth needed)")
print("-" * 70)
print(f"Semantic Coherence:      {all_results['semantic_coherence'].score:.4f} (higher is better)")
print(f"Boundary Quality:        {all_results['boundary_quality'].score:.4f} (higher is better)")
print(f"Chunk Stickiness:        {all_results['chunk_stickiness'].score:.4f} (LOWER is better)")
print(f"Topic Diversity:         {all_results['topic_diversity'].score:.4f} (higher is better)")

print("\nRETRIEVAL METRICS (require ground truth)")
print("-" * 70)
print(f"NDCG@k:                  {all_results['ndcg_at_k'].score:.4f} (higher is better)")
print(f"Recall@k:                {all_results['recall_at_k'].score:.4f} (higher is better)")
print(f"Precision@k:             {all_results['precision_at_k'].score:.4f} (higher is better)")
print(f"MRR:                     {all_results['mrr'].score:.4f} (higher is better)")

print("\nRAG QUALITY METRICS (RAGAS-inspired)")
print("-" * 70)
print(f"Context Relevance:       {all_results['context_relevance'].score:.4f} (higher is better)")
print(f"Answer Faithfulness:     {all_results['answer_faithfulness'].score:.4f} (higher is better)")
print(f"Context Precision:       {all_results['context_precision'].score:.4f} (higher is better)")
print(f"Context Recall:          {all_results['context_recall'].score:.4f} (higher is better)")
print("="*70)

## Metric Selection Guide

### Use Semantic Metrics When:
- You don't have ground truth query data
- You want to evaluate chunking quality independently
- You're comparing strategies on general document corpus
- **Recommended for initial evaluation**

### Use Retrieval Metrics When:
- You have queries and known relevant chunks
- You're building a search/retrieval system
- You want to optimize for ranking quality
- **Requires ground truth data**

### Use RAG Quality Metrics When:
- You're building a RAG system with LLMs
- You want to evaluate context quality for generation
- You care about self-containment of chunks
- **Best for RAG applications**

### Recommended Combinations:

**General Purpose (no ground truth):**
- semantic_coherence
- boundary_quality
- chunk_stickiness

**Search/Retrieval (with ground truth):**
- ndcg_at_k
- recall_at_k
- precision_at_k

**RAG Systems (with queries):**
- context_relevance
- answer_faithfulness
- context_precision
- context_recall

## Summary

In this notebook, you learned:

✅ All 12 evaluation metrics in ChunkFlow
✅ What each metric measures and how to interpret scores
✅ When to use each metric category
✅ How to create ground truth data for retrieval metrics
✅ How to evaluate chunks with multiple metrics
✅ Best practices for metric selection

## Next Steps

- **Notebook 04**: Visualization and advanced analysis
- **Notebook 05**: Using the ChunkFlow REST API