# ChunkFlow: Getting Started

Welcome to ChunkFlow! This notebook will guide you through the basics of text chunking for RAG systems.

## What You'll Learn

1. How to chunk text with different strategies
2. How to generate embeddings
3. How to evaluate chunking quality

## Installation

```bash
pip install chunk-flow[huggingface]
```

In [None]:
# Import required libraries
import asyncio
from chunk_flow.chunking import StrategyRegistry
from chunk_flow.embeddings import EmbeddingProviderFactory
from chunk_flow.evaluation import EvaluationPipeline

print("✓ ChunkFlow imported successfully!")

## 1. Sample Document

Let's start with a sample document about machine learning.

In [None]:
document = """
# Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables computers 
to learn from data without explicit programming. It has revolutionized many 
industries including healthcare, finance, and transportation.

## Types of Machine Learning

There are three main types of machine learning:

1. **Supervised Learning**: The algorithm learns from labeled training data. 
   Examples include classification and regression tasks.

2. **Unsupervised Learning**: The algorithm finds patterns in unlabeled data. 
   Common techniques include clustering and dimensionality reduction.

3. **Reinforcement Learning**: The algorithm learns through trial and error, 
   receiving rewards or penalties for its actions.

## Applications

Machine learning powers many modern applications:
- Recommendation systems (Netflix, Spotify)
- Fraud detection in banking
- Image recognition and computer vision
- Natural language processing and chatbots
- Autonomous vehicles
"""

print(f"Document length: {len(document)} characters")

## 2. Chunking Strategies

ChunkFlow provides multiple strategies. Let's explore them!

### 2.1 Fixed-Size Chunking

The simplest approach - split text into fixed-size chunks.

In [None]:
# Create fixed-size chunker
fixed_chunker = StrategyRegistry.create(
    "fixed_size",
    {"chunk_size": 200, "overlap": 50}
)

# Chunk the document
fixed_result = await fixed_chunker.chunk(document, doc_id="ml_intro")

print(f"Created {len(fixed_result.chunks)} chunks")
print(f"Processing time: {fixed_result.processing_time_ms:.2f}ms\n")

# Show first 3 chunks
for i, chunk in enumerate(fixed_result.chunks[:3], 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(f"{chunk[:100]}...\n")

### 2.2 Recursive Character Chunking (Recommended)

Respects natural boundaries like paragraphs and sentences.

In [None]:
# Create recursive chunker
recursive_chunker = StrategyRegistry.create(
    "recursive",
    {
        "chunk_size": 300,
        "overlap": 50,
        "separators": ["\n\n", "\n", ". ", " "]
    }
)

# Chunk the document
recursive_result = await recursive_chunker.chunk(document, doc_id="ml_intro")

print(f"Created {len(recursive_result.chunks)} chunks")
print(f"Processing time: {recursive_result.processing_time_ms:.2f}ms\n")

# Show first 3 chunks
for i, chunk in enumerate(recursive_result.chunks[:3], 1):
    print(f"Chunk {i}:")
    print(f"{chunk[:150]}...\n")

### 2.3 Markdown-Aware Chunking

Preserves document structure by respecting headers.

In [None]:
# Create markdown chunker
markdown_chunker = StrategyRegistry.create(
    "markdown",
    {"respect_headers": True}
)

# Chunk the document
markdown_result = await markdown_chunker.chunk(document, doc_id="ml_intro")

print(f"Created {len(markdown_result.chunks)} chunks")
print(f"Processing time: {markdown_result.processing_time_ms:.2f}ms\n")

# Show all chunks (should respect sections)
for i, chunk in enumerate(markdown_result.chunks, 1):
    print(f"\nChunk {i}:")
    print(chunk[:200] + ("..." if len(chunk) > 200 else ""))

## 3. Generate Embeddings

Now let's convert our chunks to embeddings using HuggingFace (local, free).

In [None]:
# Create embedding provider
embedder = EmbeddingProviderFactory.create(
    "huggingface",
    {
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "device": "cpu",
        "normalize": True,
    }
)

# Generate embeddings for recursive chunks
embedding_result = await embedder.embed_texts(recursive_result.chunks)

print(f"Generated {len(embedding_result.embeddings)} embeddings")
print(f"Embedding dimensions: {embedding_result.dimensions}")
print(f"Processing time: {embedding_result.processing_time_ms:.2f}ms")
print(f"Total tokens: {embedding_result.token_count}")
print(f"Cost: Free (local)")

## 4. Evaluate Chunk Quality

Use ChunkFlow's evaluation metrics to assess chunking quality.

In [None]:
# Create evaluation pipeline with semantic metrics
pipeline = EvaluationPipeline(
    metrics=[
        "semantic_coherence",
        "boundary_quality",
        "chunk_stickiness",
        "topic_diversity"
    ]
)

# Evaluate chunks
eval_results = await pipeline.evaluate(
    chunks=recursive_result.chunks,
    embeddings=embedding_result.embeddings,
)

print("Evaluation Results:\n")
for metric_name, metric_result in eval_results.items():
    print(f"{metric_name:<25} {metric_result.score:.4f}")

### Understanding the Metrics

- **Semantic Coherence** (higher is better): How semantically similar content within each chunk is
- **Boundary Quality** (higher is better): How well chunks separate distinct topics
- **Chunk Stickiness** (lower is better): How much topics bleed across chunk boundaries
- **Topic Diversity** (higher is better): How diverse topics are across all chunks

## 5. Similarity Analysis

Let's compute similarity between chunks to understand their relationships.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Convert embeddings to numpy array
embeddings_array = np.array(embedding_result.embeddings)

# Compute similarity matrix
similarity_matrix = cosine_similarity(embeddings_array)

print("Similarity Matrix:")
print(similarity_matrix)

# Find most similar chunk pairs
print("\nMost Similar Chunk Pairs:")
for i in range(len(similarity_matrix)):
    for j in range(i+1, len(similarity_matrix)):
        similarity = similarity_matrix[i, j]
        if similarity > 0.8:  # High similarity threshold
            print(f"Chunk {i+1} ↔ Chunk {j+1}: {similarity:.3f}")

## 6. Query-Based Retrieval

Simulate how chunks would be retrieved for a query.

In [None]:
# Create a query
query = "What are the different types of machine learning?"

# Embed the query
query_emb_result = await embedder.embed_texts([query])
query_embedding = query_emb_result.embeddings[0]

# Compute similarities
query_emb_array = np.array(query_embedding).reshape(1, -1)
similarities = cosine_similarity(query_emb_array, embeddings_array)[0]

# Get top 3 most relevant chunks
top_k = 3
top_indices = np.argsort(similarities)[::-1][:top_k]

print(f"Query: '{query}'\n")
print(f"Top {top_k} Most Relevant Chunks:\n")

for rank, idx in enumerate(top_indices, 1):
    print(f"{rank}. Similarity: {similarities[idx]:.4f}")
    print(f"   {recursive_result.chunks[idx][:150]}...\n")

## 7. Summary

In this notebook, you learned:

✅ How to use different chunking strategies (Fixed, Recursive, Markdown)
✅ How to generate embeddings with HuggingFace
✅ How to evaluate chunk quality with 4 metrics
✅ How to analyze chunk similarities
✅ How to retrieve relevant chunks for a query

## Next Steps

- **Notebook 02**: Compare multiple strategies
- **Notebook 03**: Advanced metrics and evaluation
- **Notebook 04**: Visualization and analysis