# Chonkie Refineries - Complete Guide

This notebook demonstrates all Refinery types in Chonkie: **OverlapRefinery** and **EmbeddingsRefinery**.

## What are Refineries?

Refineries are post-processors that enhance chunks with additional information. Each Refinery adds different capabilities:

- **OverlapRefinery**: Adds overlapping context from adjacent chunks (prefix or suffix)
- **EmbeddingsRefinery**: Adds vector embeddings to chunks for semantic search

## Key Features:
- ‚úÖ Enhance chunks after chunking process
- ‚úÖ Maintain contextual continuity with overlap
- ‚úÖ Enable semantic search with embeddings
- ‚úÖ Configurable context size and methods
- ‚úÖ Works with any chunker output

## Visual Overview

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ff6b6b','primaryTextColor':'#fff','primaryBorderColor':'#c92a2a','lineColor':'#339af0','secondaryColor':'#51cf66','tertiaryColor':'#ffd43b','background':'#f8f9fa','mainBkg':'#e3fafc','secondBkg':'#fff3bf','tertiaryBkg':'#ffe3e3','textColor':'#212529','fontSize':'16px'}}}%%

graph TB
    Start([üîß Refineries<br/>Chunk Enhancers]):::startClass
    
    Start --> RefineryType{Choose Refinery Type}:::decisionClass
    
    RefineryType -->|Add Context| OverlapRef["üìä OverlapRefinery<br/>Add adjacent context"]:::overlapClass
    RefineryType -->|Add Embeddings| EmbedRef["üß¨ EmbeddingsRefinery<br/>Add vector embeddings"]:::embedClass
    
    OverlapRef --> OverlapConfig{Configuration}:::decisionClass
    EmbedRef --> EmbedConfig{Configuration}:::decisionClass
    
    OverlapConfig -->|Method| MethodChoice["method='suffix' or 'prefix'"]:::paramClass
    OverlapConfig -->|Context Size| SizeChoice["context_size=0.25 or int"]:::paramClass
    OverlapConfig -->|Mode| ModeChoice["mode='token' or 'recursive'"]:::paramClass
    OverlapConfig -->|Merge| MergeChoice["merge=True or False"]:::paramClass
    
    EmbedConfig -->|Model| ModelChoice["embedding_model=str or instance"]:::paramClass
    
    MethodChoice --> OverlapProcess["Process Chunks"]:::processClass
    SizeChoice --> OverlapProcess
    ModeChoice --> OverlapProcess
    MergeChoice --> OverlapProcess
    
    ModelChoice --> EmbedProcess["Process Chunks"]:::processClass
    
    OverlapProcess --> OverlapOutput["üì¶ Chunks with Context<br/>text + context_before/after"]:::outputClass
    EmbedProcess --> EmbedOutput["üì¶ Chunks with Embeddings<br/>text + embedding vector"]:::embedOutputClass
    
    OverlapOutput --> UseCases{Use Cases}:::decisionClass
    EmbedOutput --> UseCases
    
    UseCases -->|Context| QA["‚ùì Question Answering<br/>Summarization"]:::useClass
    UseCases -->|Search| Semantic["üîç Semantic Search<br/>Similarity"]:::useClass
    UseCases -->|Storage| VectorDB["üíæ Vector Database<br/>Retrieval"]:::useClass
    
    classDef startClass fill:#4c6ef5,stroke:#364fc7,stroke-width:3px,color:#fff
    classDef decisionClass fill:#7950f2,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef overlapClass fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    classDef embedClass fill:#20c997,stroke:#087f5b,stroke-width:2px,color:#fff
    classDef paramClass fill:#748ffc,stroke:#4c6ef5,stroke-width:2px,color:#fff
    classDef processClass fill:#ffd43b,stroke:#fab005,stroke-width:2px,color:#333
    classDef outputClass fill:#51cf66,stroke:#37b24d,stroke-width:2px,color:#fff
    classDef embedOutputClass fill:#69db7c,stroke:#40c057,stroke-width:2px,color:#fff
    classDef useClass fill:#ff922b,stroke:#e8590c,stroke-width:2px,color:#fff
```

## Setup - Create Test Content

First, we'll create test content to demonstrate each Refinery.

In [1]:
# Test strings for demonstrations
test_strings = {
    "short": "This is the first sentence. This is the second sentence, providing context. This is the third sentence, which needs context from the second.",
    
    "medium": """Machine learning has revolutionized technology. Deep learning models can recognize patterns. 
Neural networks are inspired by the human brain. They consist of interconnected layers of nodes. 
Training these models requires large datasets. The data is processed through multiple iterations. 
Eventually, the model learns to make accurate predictions.""",
    
    "long": """Artificial intelligence is transforming industries worldwide. From healthcare to finance, AI applications are becoming ubiquitous.

Machine learning algorithms can analyze vast amounts of data. They identify patterns that humans might miss. This capability has led to breakthroughs in various fields.

Natural language processing enables computers to understand human language. Chatbots and virtual assistants rely on NLP technology. They can answer questions and provide information.

Computer vision allows machines to interpret visual information. Self-driving cars use computer vision to navigate roads. Medical imaging benefits from AI-powered diagnosis tools.

The future of AI holds immense potential. Researchers continue to push boundaries and discover new applications."""
}

print("‚úÖ Test content created:")
for name, content in test_strings.items():
    print(f"  üìù {name}: {len(content)} characters")

‚úÖ Test content created:
  üìù short: 140 characters
  üìù medium: 349 characters
  üìù long: 779 characters


## Installation

Install Chonkie to use refineries:

In [2]:
# Install chonkie
# !pip install chonkie

from chonkie import OverlapRefinery, EmbeddingsRefinery, TokenChunker

print("‚úÖ All Refineries imported successfully!")
print(f"  üìä OverlapRefinery: {OverlapRefinery}")
print(f"  üß¨ EmbeddingsRefinery: {EmbeddingsRefinery}")
print(f"  ‚úÇÔ∏è TokenChunker: {TokenChunker}")

‚úÖ All Refineries imported successfully!
  üìä OverlapRefinery: <class 'chonkie.refinery.overlap.OverlapRefinery'>
  üß¨ EmbeddingsRefinery: <class 'chonkie.refinery.embedding.EmbeddingsRefinery'>
  ‚úÇÔ∏è TokenChunker: <class 'chonkie.chunker.token.TokenChunker'>


---

# Part 1: OverlapRefinery

## 1. OverlapRefinery - Basic Initialization

Initialize OverlapRefinery with different configurations.

In [3]:
# Option 1: Default initialization (character tokenizer, 25% context)
overlap_default = OverlapRefinery()
print("üìä OverlapRefinery Initialization Options:\n")
print(f"  1. Default: {overlap_default}")

# Option 2: With suffix method (adds context from NEXT chunk)
overlap_suffix = OverlapRefinery(
    tokenizer="character",
    context_size=0.5,  # 50% of chunk size
    method="suffix",
    merge=True
)
print(f"  2. Suffix Method (50% context): {overlap_suffix}")

# Option 3: With prefix method (adds context from PREVIOUS chunk)
overlap_prefix = OverlapRefinery(
    tokenizer="character",
    context_size=30,  # Absolute number of characters
    method="prefix",
    merge=False
)
print(f"  3. Prefix Method (30 chars): {overlap_prefix}")

print("\n‚úÖ All initialization options work!")

üìä OverlapRefinery Initialization Options:

  1. Default: OverlapRefinery(tokenizer=<chonkie.tokenizer.AutoTokenizer object at 0x000002A3FFEA8830>, context_size=0.25, mode=token, method=suffix, merge=True, inplace=True)
  2. Suffix Method (50% context): OverlapRefinery(tokenizer=<chonkie.tokenizer.AutoTokenizer object at 0x000002A3CF3BDE50>, context_size=0.5, mode=token, method=suffix, merge=True, inplace=True)
  3. Prefix Method (30 chars): OverlapRefinery(tokenizer=<chonkie.tokenizer.AutoTokenizer object at 0x000002A380045FD0>, context_size=30, mode=token, method=prefix, merge=False, inplace=True)

‚úÖ All initialization options work!


## 2. OverlapRefinery - Suffix Method

Add context from the NEXT chunk to the end of the current chunk.

In [4]:
from chonkie import TokenChunker

# Step 1: Chunk the text
chunker = TokenChunker(chunk_size=50)
chunks = chunker(test_strings["short"])

print(f"üìÑ Original Chunks ({len(chunks)} chunks):\n")
for i, chunk in enumerate(chunks, 1):
    print(f"  Chunk {i}: {chunk.text}")
    print(f"  Length: {len(chunk.text)} chars\n")

# Step 2: Add suffix overlap (context from next chunk)
overlap_refinery = OverlapRefinery(
    tokenizer="character",
    context_size=0.5,  # 50% of chunk size as overlap
    method="suffix",
    merge=True
)

refined_chunks = overlap_refinery(chunks)

print(f"\nüìä Refined Chunks with SUFFIX Context ({len(refined_chunks)} chunks):\n")
for i, chunk in enumerate(refined_chunks, 1):
    print(f"  Chunk {i}: {chunk.text}")
    print(f"  Length: {len(chunk.text)} chars")
    if hasattr(chunk, 'context_after'):
        print(f"  Context after: {chunk.context_after}")
    print()

üìÑ Original Chunks (3 chunks):

  Chunk 1: This is the first sentence. This is the second sen
  Length: 50 chars

  Chunk 2: tence, providing context. This is the third senten
  Length: 50 chars

  Chunk 3: ce, which needs context from the second.
  Length: 40 chars


üìä Refined Chunks with SUFFIX Context (3 chunks):

  Chunk 1: This is the first sentence. This is the second sentence, providing context.
  Length: 75 chars

  Chunk 2: tence, providing context. This is the third sentence, which needs context f
  Length: 75 chars

  Chunk 3: ce, which needs context from the second.
  Length: 40 chars



## 3. OverlapRefinery - Prefix Method

Add context from the PREVIOUS chunk to the beginning of the current chunk.

In [5]:
# Step 1: Chunk the text
chunker = TokenChunker(chunk_size=50)
chunks = chunker(test_strings["short"])

# Step 2: Add prefix overlap (context from previous chunk)
overlap_refinery = OverlapRefinery(
    tokenizer="character",
    context_size=0.3,  # 30% of chunk size as overlap
    method="prefix",
    merge=True
)

refined_chunks = overlap_refinery(chunks)

print(f"üìä Refined Chunks with PREFIX Context ({len(refined_chunks)} chunks):\n")
for i, chunk in enumerate(refined_chunks, 1):
    print(f"  Chunk {i}: {chunk.text}")
    print(f"  Length: {len(chunk.text)} chars")
    if hasattr(chunk, 'context_before'):
        print(f"  Context before: {chunk.context_before}")
    print()

üìä Refined Chunks with PREFIX Context (3 chunks):

  Chunk 1: This is the first sentence. This is the second sen
  Length: 50 chars

  Chunk 2:  the second sentence, providing context. This is the third senten
  Length: 65 chars

  Chunk 3: he third sentence, which needs context from the second.
  Length: 55 chars



## 4. OverlapRefinery - Merge vs No Merge

- Compare merged context (added to text) vs separate context fields.
- If merge=True, the calculated context is directly prepended (for prefix) or appended (for suffix) to the chunk.text. If False, the context is stored in chunk.context attribute without modifying chunk.text.

In [8]:
# Chunk the text
chunker = TokenChunker(chunk_size=40)
chunks = chunker(test_strings["medium"])

print(f"üìÑ Testing with {len(chunks)} chunks\n")

# Option 1: Merge=True (context merged into text)
print("üìä Option 1: MERGE = TRUE (context added to text)\n")
overlap_merged = OverlapRefinery(
    tokenizer="character",
    context_size=20,
    method="suffix",
    merge=True
)
refined_merged = overlap_merged(chunks[:2])  # Just first 2 chunks for demo

for i, chunk in enumerate(refined_merged, 1):
    print(f"  Chunk {i}:")
    print(f"  Text: {chunk.text[:100]}...")
    print(f"  Length: {len(chunk.text)} chars\n")
    print(f" Context: {chunk.context}..")

# Option 2: Merge=False (context in separate field)
print("\nüìä Option 2: MERGE = FALSE (context separate)\n")
overlap_separate = OverlapRefinery(
    tokenizer="character",
    context_size=20,
    method="suffix",
    merge=False
)
refined_separate = overlap_separate(chunks[:2])

for i, chunk in enumerate(refined_separate, 1):
    print(f"  Chunk {i}:")
    print(f"  Text: {chunk.text[:80]}...")
    print(f"  Length: {len(chunk.text)} chars")
    print(f" Context: {chunk.context}..")
    if hasattr(chunk, 'context_after'):
        print(f"  Context (separate): {chunk.context_after}")
    print()

üìÑ Testing with 9 chunks

üìä Option 1: MERGE = TRUE (context added to text)

  Chunk 1:
  Text: Machine learning has revolutionized technology. Deep learnin...
  Length: 60 chars

 Context: nology. Deep learnin..
  Chunk 2:
  Text: nology. Deep learning models can recogni...
  Length: 40 chars

 Context: None..

üìä Option 2: MERGE = FALSE (context separate)

  Chunk 1:
  Text: Machine learning has revolutionized technology. Deep learnin...
  Length: 60 chars
 Context: nology. Deep learnin..

  Chunk 2:
  Text: nology. Deep learning models can recogni...
  Length: 40 chars
 Context: None..



## 5. OverlapRefinery - Context Size Comparison

Compare different context sizes (fraction vs absolute).

In [9]:
# Chunk the text
chunker = TokenChunker(chunk_size=60)
chunks = chunker(test_strings["medium"])
test_chunks = chunks[:3]  # Use first 3 chunks

print("üìä Context Size Comparison\n")
print(f"Testing with {len(test_chunks)} chunks\n")

# Test 1: 25% context (fraction)
print("1Ô∏è‚É£ Context Size = 0.25 (25% of chunk)")
refinery_25 = OverlapRefinery(context_size=0.25, method="suffix", merge=True)
refined_25 = refinery_25(test_chunks)
print(f"   Avg length: {sum(len(c.text) for c in refined_25) / len(refined_25):.0f} chars\n")

# Test 2: 50% context (fraction)
print("2Ô∏è‚É£ Context Size = 0.5 (50% of chunk)")
refinery_50 = OverlapRefinery(context_size=0.5, method="suffix", merge=True)
refined_50 = refinery_50(test_chunks)
print(f"   Avg length: {sum(len(c.text) for c in refined_50) / len(refined_50):.0f} chars\n")

# Test 3: 30 chars absolute
print("3Ô∏è‚É£ Context Size = 30 (absolute chars)")
refinery_abs = OverlapRefinery(context_size=30, method="suffix", merge=True)
refined_abs = refinery_abs(test_chunks)
print(f"   Avg length: {sum(len(c.text) for c in refined_abs) / len(refined_abs):.0f} chars\n")

print("‚úÖ Different context sizes demonstrated!")

üìä Context Size Comparison

Testing with 3 chunks

1Ô∏è‚É£ Context Size = 0.25 (25% of chunk)
   Avg length: 70 chars

2Ô∏è‚É£ Context Size = 0.5 (50% of chunk)
   Avg length: 95 chars

3Ô∏è‚É£ Context Size = 30 (absolute chars)
   Avg length: 115 chars

‚úÖ Different context sizes demonstrated!


## 6. OverlapRefinery - Use Case: Question Answering

Demonstrate how overlap helps maintain context for QA.

In [10]:
qa_text = """Python was created by Guido van Rossum in 1991. It emphasizes code readability and simplicity.
The language supports multiple programming paradigms. These include procedural, object-oriented, and functional programming.
Python's extensive standard library is one of its greatest strengths. It provides modules for various tasks.
Many companies use Python for web development, data science, and automation."""

# WITHOUT overlap
print("‚ùå WITHOUT Overlap Refinery:\n")
chunker = TokenChunker(chunk_size=50)
chunks_no_overlap = chunker(qa_text)

for i, chunk in enumerate(chunks_no_overlap, 1):
    print(f"  Chunk {i}: {chunk.text}")
    print(f"  ‚Üí Context: Limited to chunk only\n")

# WITH overlap
print("\n‚úÖ WITH Overlap Refinery:\n")
overlap_refinery = OverlapRefinery(
    context_size=0.4,
    method="prefix",
    merge=True
)
chunks_with_overlap = overlap_refinery(chunks_no_overlap)

for i, chunk in enumerate(chunks_with_overlap, 1):
    print(f"  Chunk {i}: {chunk.text}")
    print(f"  ‚Üí Context: Includes previous chunk context\n")

print("üí° With overlap, each chunk has more context for answering questions!")

‚ùå WITHOUT Overlap Refinery:

  Chunk 1: Python was created by Guido van Rossum in 1991. It
  ‚Üí Context: Limited to chunk only

  Chunk 2:  emphasizes code readability and simplicity.
The l
  ‚Üí Context: Limited to chunk only

  Chunk 3: anguage supports multiple programming paradigms. T
  ‚Üí Context: Limited to chunk only

  Chunk 4: hese include procedural, object-oriented, and func
  ‚Üí Context: Limited to chunk only

  Chunk 5: tional programming.
Python's extensive standard li
  ‚Üí Context: Limited to chunk only

  Chunk 6: brary is one of its greatest strengths. It provide
  ‚Üí Context: Limited to chunk only

  Chunk 7: s modules for various tasks.
Many companies use Py
  ‚Üí Context: Limited to chunk only

  Chunk 8: thon for web development, data science, and automa
  ‚Üí Context: Limited to chunk only

  Chunk 9: tion.
  ‚Üí Context: Limited to chunk only


‚úÖ WITH Overlap Refinery:

  Chunk 1: Python was created by Guido van Rossum in 1991. It
  ‚Üí Context: Includes

---

# Part 2: EmbeddingsRefinery

## 7. EmbeddingsRefinery - Basic Initialization

Initialize EmbeddingsRefinery with an embedding model.

In [11]:
# Initialize with model string identifier
em_refinery = EmbeddingsRefinery(
    embedding_model="minishlab/potion-base-32M"
)

print("üß¨ EmbeddingsRefinery Initialization:\n")
print(f"  Model: minishlab/potion-base-32M")
print(f"  Refinery: {em_refinery}")
print("\n‚úÖ EmbeddingsRefinery ready!")

üß¨ EmbeddingsRefinery Initialization:

  Model: minishlab/potion-base-32M
  Refinery: EmbeddingsRefinery(embedding_model=Model2VecEmbeddings(model=minishlab/potion-base-32M))

‚úÖ EmbeddingsRefinery ready!


## 8. EmbeddingsRefinery - Add Embeddings to Chunks

Add vector embeddings to chunks for semantic search.

In [12]:
# Step 1: Chunk the text
test_text = test_strings["medium"]
chunker = TokenChunker(chunk_size=60)
chunks = chunker(test_text)

print(f"üìÑ Original Chunks ({len(chunks)} chunks):\n")
for i, chunk in enumerate(chunks, 1):
    print(f"  Chunk {i}: {chunk.text[:60]}...")
    print(f"  Has embedding: {hasattr(chunk, 'embedding')}\n")

# Step 2: Add embeddings
em_refinery = EmbeddingsRefinery(
    embedding_model="minishlab/potion-base-32M"
)

chunks_with_embeddings = em_refinery(chunks)

print(f"\nüß¨ Chunks with Embeddings ({len(chunks_with_embeddings)} chunks):\n")
for i, chunk in enumerate(chunks_with_embeddings, 1):
    print(f"  Chunk {i}: {chunk.text[:60]}...")
    if hasattr(chunk, 'embedding'):
        print(f"  ‚úÖ Has embedding: shape {chunk.embedding.shape}, dtype {chunk.embedding.dtype}")
        print(f"  First 5 values: {chunk.embedding[:5]}")
    print()

üìÑ Original Chunks (6 chunks):

  Chunk 1: Machine learning has revolutionized technology. Deep learnin...
  Has embedding: True

  Chunk 2: g models can recognize patterns. 
Neural networks are inspir...
  Has embedding: True

  Chunk 3: ed by the human brain. They consist of interconnected layers...
  Has embedding: True

  Chunk 4:  of nodes. 
Training these models requires large datasets. T...
  Has embedding: True

  Chunk 5: he data is processed through multiple iterations. 
Eventuall...
  Has embedding: True

  Chunk 6: y, the model learns to make accurate predictions....
  Has embedding: True


üß¨ Chunks with Embeddings (6 chunks):

  Chunk 1: Machine learning has revolutionized technology. Deep learnin...
  ‚úÖ Has embedding: shape (512,), dtype float32
  First 5 values: [ 0.00299574  0.18857926 -0.17724113  0.06512574 -0.02891888]

  Chunk 2: g models can recognize patterns. 
Neural networks are inspir...
  ‚úÖ Has embedding: shape (512,), dtype float32
  First 5 values: 

## 9. EmbeddingsRefinery - Embedding Properties

Explore the properties of generated embeddings.

In [13]:
import numpy as np

# Create chunks and add embeddings
test_text = "Machine learning is powerful. Deep learning is a subset. Neural networks drive AI."
chunker = TokenChunker(chunk_size=20)
chunks = chunker(test_text)

em_refinery = EmbeddingsRefinery(embedding_model="minishlab/potion-base-32M")
chunks_with_embeddings = em_refinery(chunks)

print("üß¨ Embedding Analysis:\n")

for i, chunk in enumerate(chunks_with_embeddings, 1):
    if hasattr(chunk, 'embedding'):
        emb = chunk.embedding
        print(f"Chunk {i}: \"{chunk.text}\"")
        print(f"  Shape: {emb.shape}")
        print(f"  Dtype: {emb.dtype}")
        print(f"  Min value: {emb.min():.4f}")
        print(f"  Max value: {emb.max():.4f}")
        print(f"  Mean: {emb.mean():.4f}")
        print(f"  Std: {emb.std():.4f}")
        print()

üß¨ Embedding Analysis:

Chunk 1: "Machine learning is "
  Shape: (512,)
  Dtype: float32
  Min value: -0.1618
  Max value: 0.1756
  Mean: -0.0005
  Std: 0.0442

Chunk 2: "powerful. Deep learn"
  Shape: (512,)
  Dtype: float32
  Min value: -0.1402
  Max value: 0.1917
  Mean: 0.0004
  Std: 0.0442

Chunk 3: "ing is a subset. Neu"
  Shape: (512,)
  Dtype: float32
  Min value: -0.1765
  Max value: 0.1641
  Mean: 0.0005
  Std: 0.0442

Chunk 4: "ral networks drive A"
  Shape: (512,)
  Dtype: float32
  Min value: -0.1694
  Max value: 0.1540
  Mean: 0.0010
  Std: 0.0442

Chunk 5: "I."
  Shape: (512,)
  Dtype: float32
  Min value: -0.1266
  Max value: 0.1258
  Mean: -0.0018
  Std: 0.0442



## 10. EmbeddingsRefinery - Semantic Similarity

Calculate similarity between chunk embeddings.

In [14]:
import numpy as np

def cosine_similarity(emb1, emb2):
    """Calculate cosine similarity between two embeddings."""
    return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

# Test different texts
texts = [
    "Python is a programming language",
    "Java is also a programming language",
    "I love eating pizza for dinner"
]

# Create chunks and add embeddings
chunker = TokenChunker(chunk_size=100)  # Large enough for full sentences
all_chunks = []
for text in texts:
    chunks = chunker(text)
    all_chunks.extend(chunks)

em_refinery = EmbeddingsRefinery(embedding_model="minishlab/potion-base-32M")
embedded_chunks = em_refinery(all_chunks)

print("üîç Semantic Similarity Analysis:\n")

# Compare all pairs
for i in range(len(embedded_chunks)):
    for j in range(i+1, len(embedded_chunks)):
        if hasattr(embedded_chunks[i], 'embedding') and hasattr(embedded_chunks[j], 'embedding'):
            sim = cosine_similarity(embedded_chunks[i].embedding, embedded_chunks[j].embedding)
            print(f"Text 1: \"{embedded_chunks[i].text}\"")
            print(f"Text 2: \"{embedded_chunks[j].text}\"")
            print(f"Similarity: {sim:.4f}")
            print(f"Interpretation: {'üü¢ Very Similar' if sim > 0.8 else 'üü° Somewhat Similar' if sim > 0.5 else 'üî¥ Not Similar'}\n")

üîç Semantic Similarity Analysis:

Text 1: "Python is a programming language"
Text 2: "Java is also a programming language"
Similarity: 0.7655
Interpretation: üü° Somewhat Similar

Text 1: "Python is a programming language"
Text 2: "I love eating pizza for dinner"
Similarity: 0.2002
Interpretation: üî¥ Not Similar

Text 1: "Java is also a programming language"
Text 2: "I love eating pizza for dinner"
Similarity: 0.0951
Interpretation: üî¥ Not Similar



## 11. EmbeddingsRefinery - Use Case: Vector Search

Demonstrate semantic search using embeddings.

In [15]:
# Create a mini document corpus
documents = [
    "Python is excellent for data science and machine learning projects.",
    "JavaScript is the primary language for web development.",
    "Machine learning models require large amounts of training data.",
    "Web browsers execute JavaScript code on the client side.",
    "Neural networks are inspired by biological brain structures."
]

# Chunk and embed documents
chunker = TokenChunker(chunk_size=150)
all_chunks = []
for doc in documents:
    chunks = chunker(doc)
    all_chunks.extend(chunks)

em_refinery = EmbeddingsRefinery(embedding_model="minishlab/potion-base-32M")
embedded_chunks = em_refinery(all_chunks)

# Search query
query = "What programming language is best for AI?"
query_chunks = chunker(query)
query_embedded = em_refinery(query_chunks)
query_embedding = query_embedded[0].embedding

print(f"üîç Semantic Search Results\n")
print(f"Query: \"{query}\"\n")

# Calculate similarities
results = []
for chunk in embedded_chunks:
    if hasattr(chunk, 'embedding'):
        sim = cosine_similarity(query_embedding, chunk.embedding)
        results.append((chunk.text, sim))

# Sort by similarity
results.sort(key=lambda x: x[1], reverse=True)

print("Top 3 Most Relevant Results:\n")
for i, (text, score) in enumerate(results[:3], 1):
    print(f"{i}. Score: {score:.4f}")
    print(f"   Text: {text}")
    print()

üîç Semantic Search Results

Query: "What programming language is best for AI?"

Top 3 Most Relevant Results:

1. Score: 0.4881
   Text: JavaScript is the primary language for web development.

2. Score: 0.4169
   Text: Python is excellent for data science and machine learning projects.

3. Score: 0.3884
   Text: Web browsers execute JavaScript code on the client side.



---

# Part 3: Combined Refineries

## 12. Using Both Refineries Together

Combine OverlapRefinery and EmbeddingsRefinery for maximum enhancement.

In [16]:
# Step 1: Chunk the text
test_text = test_strings["long"]
chunker = TokenChunker(chunk_size=80)
chunks = chunker(test_text)

print(f"üìÑ Starting with {len(chunks)} chunks\n")

# Step 2: Add overlap context
overlap_refinery = OverlapRefinery(
    context_size=0.3,
    method="suffix",
    merge=True
)
chunks_with_overlap = overlap_refinery(chunks)
print(f"‚úÖ Step 1: Added overlap context")

# Step 3: Add embeddings
em_refinery = EmbeddingsRefinery(
    embedding_model="minishlab/potion-base-32M"
)
fully_refined_chunks = em_refinery(chunks_with_overlap)
print(f"‚úÖ Step 2: Added embeddings\n")

# Display results
print(f"üéØ Fully Refined Chunks ({len(fully_refined_chunks)} chunks):\n")
for i, chunk in enumerate(fully_refined_chunks[:3], 1):  # Show first 3
    print(f"Chunk {i}:")
    print(f"  Text: {chunk.text[:70]}...")
    print(f"  Has overlap context: {len(chunk.text) > 80}")
    print(f"  Has embedding: {hasattr(chunk, 'embedding')}")
    if hasattr(chunk, 'embedding'):
        print(f"  Embedding shape: {chunk.embedding.shape}")
    print()

print("‚ú® Chunks now have both contextual overlap AND semantic embeddings!")

üìÑ Starting with 10 chunks

‚úÖ Step 1: Added overlap context
‚úÖ Step 2: Added embeddings

üéØ Fully Refined Chunks (10 chunks):

Chunk 1:
  Text: Artificial intelligence is transforming industries worldwide. From hea...
  Has overlap context: True
  Has embedding: True
  Embedding shape: (512,)

Chunk 2:
  Text:  finance, AI applications are becoming ubiquitous.

Machine learning a...
  Has overlap context: True
  Has embedding: True
  Embedding shape: (512,)

Chunk 3:
  Text: can analyze vast amounts of data. They identify patterns that humans m...
  Has overlap context: True
  Has embedding: True
  Embedding shape: (512,)

‚ú® Chunks now have both contextual overlap AND semantic embeddings!


## 13. Real-World Pipeline: Document Processing

Complete pipeline: Chunk ‚Üí Add Context ‚Üí Add Embeddings ‚Üí Ready for Vector DB

In [17]:
def process_document_for_vectordb(text, chunk_size=100, context_size=0.25):
    """Process document with chunking, overlap, and embeddings."""
    # Step 1: Chunk
    chunker = TokenChunker(chunk_size=chunk_size)
    chunks = chunker(text)
    print(f"üìù Step 1: Created {len(chunks)} chunks")
    
    # Step 2: Add overlap
    overlap_refinery = OverlapRefinery(
        context_size=context_size,
        method="suffix",
        merge=True
    )
    chunks = overlap_refinery(chunks)
    print(f"üìä Step 2: Added overlap context")
    
    # Step 3: Add embeddings
    em_refinery = EmbeddingsRefinery(
        embedding_model="minishlab/potion-base-32M"
    )
    chunks = em_refinery(chunks)
    print(f"üß¨ Step 3: Added embeddings")
    
    return chunks

# Process a document
document = test_strings["long"]
print("üöÄ Processing Document for Vector Database\n")
print(f"Document length: {len(document)} characters\n")

processed_chunks = process_document_for_vectordb(document)

print(f"\n‚úÖ Document processed! Ready for vector database insertion")
print(f"\nüì¶ Output Summary:")
print(f"  Total chunks: {len(processed_chunks)}")
print(f"  Each chunk has:")
print(f"    - Text content with overlap context")
print(f"    - Vector embedding for semantic search")
print(f"    - Metadata (start_index, end_index, token_count)")

# Show sample chunk structure
print(f"\nüìã Sample Chunk Structure:")
sample = processed_chunks[0]
print(f"  text: {sample.text[:60]}...")
if hasattr(sample, 'embedding'):
    print(f"  embedding: array of shape {sample.embedding.shape}")
print(f"  start_index: {sample.start_index}")
print(f"  end_index: {sample.end_index}")
print(f"  token_count: {sample.token_count}")

üöÄ Processing Document for Vector Database

Document length: 779 characters

üìù Step 1: Created 8 chunks
üìä Step 2: Added overlap context
üß¨ Step 3: Added embeddings

‚úÖ Document processed! Ready for vector database insertion

üì¶ Output Summary:
  Total chunks: 8
  Each chunk has:
    - Text content with overlap context
    - Vector embedding for semantic search
    - Metadata (start_index, end_index, token_count)

üìã Sample Chunk Structure:
  text: Artificial intelligence is transforming industries worldwide...
  embedding: array of shape (512,)
  start_index: 0
  end_index: 100
  token_count: 125


---

## Summary: All Refinery Types and Capabilities

### Refinery Comparison Table

| Refinery | Purpose | Key Parameters | Output | Use Cases |
|----------|---------|----------------|--------|-----------||
| **OverlapRefinery** | Add context from adjacent chunks | `context_size`, `method`, `merge` | Chunks with overlap context | QA, Summarization, Context preservation |
| **EmbeddingsRefinery** | Add vector embeddings | `embedding_model` | Chunks with embeddings | Semantic search, Vector DB, Similarity |

### OverlapRefinery Parameters

- **tokenizer**: `"character"`, `"word"`, `"gpt2"`, or custom (default: `"character"`)
- **context_size**: `float` (0-1 as fraction) or `int` (absolute tokens) (default: `0.25`)
- **method**: `"suffix"` (next chunk) or `"prefix"` (previous chunk) (default: `"suffix"`)
- **mode**: `"token"` or `"recursive"` (default: `"token"`)
- **merge**: `True` (merge into text) or `False` (separate field) (default: `True`)

### EmbeddingsRefinery Parameters

- **embedding_model**: Model string identifier or `BaseEmbeddings` instance (required)

### Methods Available

All refineries support:
- `refine(chunks)` - Refine a list of chunks
- `__call__(chunks)` - Callable interface for refining

### Best Practices

‚úÖ **OverlapRefinery**:
- Use `suffix` for forward-looking context (next chunk)
- Use `prefix` for backward-looking context (previous chunk)
- Set `context_size` to 0.25-0.5 for good balance
- Use `merge=True` for simpler data structure
- Useful for QA and summarization tasks

‚úÖ **EmbeddingsRefinery**:
- Choose appropriate embedding model for your domain
- Smaller models (`32M`) are faster, larger models more accurate
- Essential for semantic search and vector databases
- Calculate cosine similarity for relevance scoring

‚úÖ **Combined Usage**:
- Apply OverlapRefinery first, then EmbeddingsRefinery
- Overlap provides context, embeddings enable search
- Perfect for production RAG (Retrieval-Augmented Generation) systems
- Ready for vector database insertion (Pinecone, Weaviate, ChromaDB, etc.)

### Integration Pattern

```python
# Standard refinement pipeline
chunks = chunker(text)
chunks = overlap_refinery(chunks)     # Add context
chunks = embeddings_refinery(chunks)  # Add embeddings
# ‚Üí Ready for vector database!
```