# ChunkFlow: Strategy Comparison

This notebook demonstrates how to compare multiple chunking strategies to find the best approach for your use case.

## What You'll Learn

1. How to set up multiple strategies for comparison
2. How to evaluate strategies with multiple metrics
3. How to interpret comparison results
4. How to make data-driven decisions about chunking strategies

## Prerequisites

```bash
pip install chunk-flow[huggingface,viz]
```

In [None]:
# Import required libraries
import asyncio
import numpy as np
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline, StrategyComparator
from chunk_flow.analysis import ResultsDataFrame

print("✓ All imports successful!")

## 1. Sample Document

We'll use a longer document about AI to better demonstrate strategy differences.

In [None]:
document = """
# Artificial Intelligence: A Comprehensive Overview

Artificial Intelligence (AI) represents one of the most transformative technologies of the 21st century. 
It encompasses the development of computer systems capable of performing tasks that typically require 
human intelligence, such as visual perception, speech recognition, decision-making, and language translation.

## Historical Context

The concept of AI dates back to ancient mythology, but modern AI research began in the 1950s. Alan Turing's 
groundbreaking paper "Computing Machinery and Intelligence" posed the famous question: "Can machines think?" 
This led to the Turing Test, a criterion for determining machine intelligence.

The term "Artificial Intelligence" was coined by John McCarthy in 1956 at the Dartmouth Conference, which 
is considered the birth of AI as a field. Early optimism led to predictions that human-level AI would be 
achieved within a generation. However, the field experienced several "AI winters" - periods of reduced 
funding and interest due to unmet expectations.

## Machine Learning Fundamentals

Machine learning, a subset of AI, has become the dominant approach in modern AI systems. Instead of 
explicitly programming rules, machine learning algorithms learn patterns from data.

### Supervised Learning

Supervised learning algorithms learn from labeled training data. The algorithm receives input-output 
pairs and learns to map inputs to correct outputs. Common applications include:

- Image classification (identifying objects in photos)
- Spam detection in email
- Credit scoring and fraud detection
- Medical diagnosis from symptoms

Popular supervised learning algorithms include linear regression, logistic regression, decision trees, 
random forests, support vector machines, and neural networks.

### Unsupervised Learning

Unsupervised learning finds patterns in unlabeled data without predefined categories. The algorithm 
discovers hidden structures on its own. Key techniques include:

- Clustering: Grouping similar data points (customer segmentation, document organization)
- Dimensionality reduction: Simplifying data while preserving important information (PCA, t-SNE)
- Anomaly detection: Identifying unusual patterns (fraud detection, system monitoring)
- Association rule learning: Discovering relationships between variables (market basket analysis)

### Reinforcement Learning

Reinforcement learning trains agents to make sequences of decisions by rewarding desired behaviors and 
penalizing undesired ones. This approach has achieved remarkable success in:

- Game playing (AlphaGo defeating world champions at Go)
- Robotics (teaching robots to walk, grasp objects)
- Autonomous vehicles (navigation and decision-making)
- Resource optimization (data center cooling, traffic light control)

## Deep Learning Revolution

Deep learning, based on artificial neural networks with multiple layers, has driven recent AI breakthroughs. 
The "deep" in deep learning refers to the number of layers in the network.

### Neural Network Architecture

Deep neural networks consist of:
- Input layer: Receives raw data
- Hidden layers: Extract increasingly abstract features
- Output layer: Produces predictions or classifications

Key innovations enabling deep learning success:
- GPU acceleration for parallel computation
- Large labeled datasets (ImageNet, Common Crawl)
- Improved training algorithms (backpropagation, Adam optimizer)
- Regularization techniques (dropout, batch normalization)

### Convolutional Neural Networks (CNNs)

CNNs revolutionized computer vision by automatically learning spatial hierarchies of features. Applications:
- Image classification and object detection
- Facial recognition systems
- Medical image analysis (detecting tumors, diagnosing diseases)
- Autonomous vehicle perception

### Recurrent Neural Networks (RNNs)

RNNs and their variants (LSTM, GRU) excel at sequential data processing:
- Natural language processing and machine translation
- Speech recognition and synthesis
- Time series prediction (stock prices, weather)
- Video analysis and action recognition

### Transformer Architecture

Transformers, introduced in 2017, use attention mechanisms to process sequential data more effectively. 
They've become the foundation for modern language models like GPT, BERT, and T5. Benefits include:
- Parallel processing (faster training)
- Better handling of long-range dependencies
- Transfer learning capabilities
- Superior performance on NLP tasks

## Natural Language Processing

NLP enables computers to understand, interpret, and generate human language. Recent advances:

- Large language models achieving human-level performance on many tasks
- Context-aware translation (Google Translate, DeepL)
- Sentiment analysis for social media monitoring
- Question answering systems
- Text generation and summarization
- Conversational AI and chatbots

## Computer Vision

Computer vision enables machines to interpret visual information:
- Object detection and tracking
- Semantic segmentation (pixel-level classification)
- Pose estimation (understanding human body positions)
- 3D reconstruction from 2D images
- Visual question answering

## Ethical Considerations

As AI becomes more powerful, ethical concerns grow:

- Bias and fairness: AI systems can perpetuate or amplify existing biases
- Privacy: Data collection and surveillance concerns
- Transparency: "Black box" models are difficult to interpret
- Accountability: Who is responsible when AI makes mistakes?
- Job displacement: Automation may eliminate certain jobs
- Autonomous weapons: Military AI applications raise moral questions

## Future Directions

The future of AI holds exciting possibilities:

- Artificial General Intelligence (AGI): Systems with human-level intelligence across all domains
- Explainable AI: Making AI decisions transparent and interpretable
- Edge AI: Running AI on devices rather than in the cloud
- Quantum machine learning: Leveraging quantum computing for AI
- Neuromorphic computing: Brain-inspired hardware architectures
- Human-AI collaboration: Augmenting rather than replacing human capabilities

## Conclusion

AI continues to evolve rapidly, transforming industries and daily life. While challenges remain, 
responsible development and deployment of AI technologies promise to address some of humanity's 
most pressing problems while creating new opportunities for innovation and progress.
"""

print(f"Document length: {len(document)} characters")
print(f"Approximate words: {len(document.split())} words")

## 2. Create Strategies to Compare

We'll compare 5 different strategies with various configurations.

In [None]:
# Create strategies with different configurations
strategies = {
    "fixed_small": StrategyRegistry.create(
        "fixed_size",
        {"chunk_size": 300, "overlap": 50}
    ),
    "fixed_large": StrategyRegistry.create(
        "fixed_size",
        {"chunk_size": 600, "overlap": 100}
    ),
    "recursive_default": StrategyRegistry.create(
        "recursive",
        {"chunk_size": 500, "overlap": 80, "separators": ["\n\n", "\n", ". ", " "]}
    ),
    "markdown_aware": StrategyRegistry.create(
        "markdown",
        {"respect_headers": True, "max_chunk_size": 800}
    ),
    "semantic": StrategyRegistry.create(
        "semantic",
        {"threshold_percentile": 75, "min_chunk_size": 200}
    ),
}

print(f"Created {len(strategies)} strategies for comparison:")
for name in strategies.keys():
    print(f"  - {name}")

## 3. Chunk Document with Each Strategy

Let's see how each strategy chunks the document.

In [None]:
# Chunk document with each strategy
chunk_results = {}

for name, strategy in strategies.items():
    result = await strategy.chunk(document, doc_id="ai_overview")
    chunk_results[name] = result
    print(f"\n{name}:")
    print(f"  Chunks created: {len(result.chunks)}")
    print(f"  Processing time: {result.processing_time_ms:.2f}ms")
    print(f"  Avg chunk size: {np.mean([len(c) for c in result.chunks]):.1f} chars")
    print(f"  Size range: {min(len(c) for c in result.chunks)} - {max(len(c) for c in result.chunks)} chars")

## 4. Generate Embeddings for All Chunks

We need embeddings to compute semantic metrics.

In [None]:
# Create embedding provider
embedder = EmbeddingProviderFactory.create(
    "huggingface",
    {
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "device": "cpu",
        "normalize": True,
    }
)

# Generate embeddings for each strategy's chunks
embedding_results = {}

for name, chunk_result in chunk_results.items():
    emb_result = await embedder.embed_texts(chunk_result.chunks)
    embedding_results[name] = emb_result
    print(f"{name}: Generated {len(emb_result.embeddings)} embeddings in {emb_result.processing_time_ms:.2f}ms")

## 5. Evaluate Each Strategy

Now let's evaluate all strategies using multiple metrics.

In [None]:
# Create evaluation pipeline with semantic metrics (no ground truth needed)
pipeline = EvaluationPipeline(
    metrics=[
        "semantic_coherence",
        "boundary_quality",
        "chunk_stickiness",
        "topic_diversity"
    ]
)

# Evaluate each strategy
evaluation_results = {}

for name in strategies.keys():
    eval_result = await pipeline.evaluate(
        chunks=chunk_results[name].chunks,
        embeddings=embedding_results[name].embeddings,
    )
    evaluation_results[name] = eval_result

print("✓ Evaluation complete for all strategies")

## 6. Compare Results

Let's organize and compare the results.

In [None]:
# Create comparison table
import pandas as pd

comparison_data = []

for name in strategies.keys():
    row = {
        "Strategy": name,
        "Chunks": len(chunk_results[name].chunks),
        "Avg Size": int(np.mean([len(c) for c in chunk_results[name].chunks])),
    }
    
    # Add metric scores
    for metric_name, metric_result in evaluation_results[name].items():
        row[metric_name] = metric_result.score
    
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
print("\nStrategy Comparison:\n")
print(comparison_df.to_string(index=False))

## 7. Rank Strategies

Let's rank strategies by their performance.

In [None]:
# Use StrategyComparator for advanced analysis
report = StrategyComparator.generate_comparison_report(evaluation_results)

print("\n" + "="*80)
print("STRATEGY COMPARISON REPORT")
print("="*80)
print(report)

## 8. Detailed Metric Analysis

Let's analyze each metric individually.

In [None]:
# Analyze each metric
metrics = ["semantic_coherence", "boundary_quality", "chunk_stickiness", "topic_diversity"]

for metric in metrics:
    print(f"\n{'='*60}")
    print(f"{metric.upper().replace('_', ' ')}")
    print(f"{'='*60}")
    
    # Get scores for this metric
    scores = [(name, evaluation_results[name][metric].score) for name in strategies.keys()]
    
    # Sort by score (handle stickiness which is inverse)
    if metric == "chunk_stickiness":
        scores.sort(key=lambda x: x[1])  # Lower is better
        print("(Lower is better - less topic bleeding across boundaries)\n")
    else:
        scores.sort(key=lambda x: x[1], reverse=True)  # Higher is better
        print("(Higher is better)\n")
    
    # Print ranked results
    for rank, (name, score) in enumerate(scores, 1):
        bar = "█" * int(score * 50)
        print(f"{rank}. {name:20s} {score:.4f} {bar}")

## 9. ResultsDataFrame Analysis

Use the ResultsDataFrame for advanced analysis.

In [None]:
# Create ResultsDataFrame
results_df = ResultsDataFrame.from_evaluation_results(
    evaluation_results,
    strategy_names=list(strategies.keys())
)

# Rank strategies by weighted score
# Give higher weight to coherence and boundary quality
ranked = results_df.rank_strategies(
    weights={
        "semantic_coherence": 2.0,
        "boundary_quality": 2.0,
        "chunk_stickiness": 1.0,  # Inverted internally
        "topic_diversity": 1.0,
    },
    ascending=False
)

print("\nWeighted Ranking (Coherence & Boundary Quality weighted 2x):\n")
print(ranked[["strategy", "weighted_score", "semantic_coherence", "boundary_quality"]].to_string(index=False))

## 10. Export Results

Save the comparison results for future reference.

In [None]:
# Export to CSV
output_file = "strategy_comparison_results.csv"
comparison_df.to_csv(output_file, index=False)
print(f"\n✓ Results exported to {output_file}")

# Also export detailed results
results_df.to_csv("detailed_results.csv")
print("✓ Detailed results exported to detailed_results.csv")

## 11. Insights and Recommendations

Based on the comparison, here are general insights:

### Key Findings

1. **Semantic Coherence**: Measures how semantically similar content within each chunk is
   - Higher scores indicate chunks contain related concepts
   - Markdown and semantic strategies typically score higher

2. **Boundary Quality**: Measures how well chunks separate distinct topics
   - Higher scores indicate cleaner topic boundaries
   - Recursive and markdown strategies respect natural boundaries better

3. **Chunk Stickiness** (lower is better): Measures topic bleeding across boundaries
   - Lower scores indicate less overlap between adjacent chunks
   - Fixed-size strategies may have higher stickiness

4. **Topic Diversity**: Measures variety of topics across all chunks
   - Higher scores indicate chunks cover different topics
   - Depends on document structure and strategy

### Recommendations

- **For structured documents (with headers)**: Use `markdown` or `recursive` strategies
- **For maximum speed**: Use `fixed_size` with appropriate chunk size
- **For best semantic quality**: Use `semantic` or `recursive` strategies
- **For balanced approach**: Use `recursive` (recommended default)

### Next Steps

- Try different configurations for each strategy
- Test with your specific document types
- Consider retrieval metrics if you have query data
- Visualize results (see notebook 04)

## Summary

In this notebook, you learned:

✅ How to create and configure multiple chunking strategies
✅ How to chunk documents with different strategies
✅ How to evaluate strategies with semantic metrics
✅ How to compare and rank strategies
✅ How to interpret evaluation results
✅ How to make data-driven decisions about chunking

## Next Steps

- **Notebook 03**: Deep dive into all 12 evaluation metrics
- **Notebook 04**: Visualization and advanced analysis
- **Notebook 05**: Using the ChunkFlow REST API