# üöÄ RAG Pipeline - End to End (Local Version)

This notebook implements a complete **Retrieval Augmented Generation (RAG)** pipeline:

1. **Load** a PDF file from local path
2. **Chunk** into semantic pieces
3. **Embed** using OpenAI embeddings
4. **Store** in Qdrant vector database
5. **Query** and retrieve relevant context

Run each cell sequentially to build your RAG system!

---
## Cell 1: Install Dependencies
Run this cell first to install all required packages.

In [12]:
# Install required packages
!pip install -q pdfplumber openai qdrant-client rank-bm25 numpy python-dotenv sentence-transformers

print("‚úÖ All dependencies installed!")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/macbookpro/Documents/Agent-langchain-rag-mcp-tools-boilerplate/venv/bin/python3.11 -m pip install --upgrade pip[0m
‚úÖ All dependencies installed!


---
## Cell 2: Configuration
Set up your OpenAI API key and other settings.

In [13]:
import os
from dotenv import load_dotenv

# Load from .env file if it exists
load_dotenv()

# =============================================================================
# API KEY SETUP
# =============================================================================
# Option 1: Set directly here
# os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

# Option 2: Use .env file (recommended) - already loaded above

# Check if API key is set
if not os.environ.get("OPENAI_API_KEY"):
    print("‚ö†Ô∏è  OPENAI_API_KEY not found!")
    print("   Set it in .env file or uncomment Option 1 above")
else:
    print("‚úÖ OpenAI API Key found!")

# =============================================================================
# CONFIGURATION
# =============================================================================
# Paths
QDRANT_PATH = "./qdrant_db"  # Local storage for Qdrant
COLLECTION_NAME = "rag_collection"

# Embedding settings (OpenAI text-embedding-3-large)
EMBED_MODEL = "text-embedding-3-large"
EMBED_DIMENSION = 3072

# Chunking settings
CHUNK_MIN_SIZE = 200   # Minimum characters per chunk
CHUNK_MAX_SIZE = 800   # Maximum characters per chunk

# Search settings
VECTOR_WEIGHT = 0.7    # Weight for vector (semantic) search
BM25_WEIGHT = 0.3      # Weight for BM25 (keyword) search
DEFAULT_TOP_K = 5      # Number of results to retrieve

print(f"\nüìã Configuration:")
print(f"   Embedding model: {EMBED_MODEL}")
print(f"   Chunk size: {CHUNK_MIN_SIZE}-{CHUNK_MAX_SIZE} chars")
print(f"   Qdrant path: {QDRANT_PATH}")

‚úÖ OpenAI API Key found!

üìã Configuration:
   Embedding model: text-embedding-3-large
   Chunk size: 200-800 chars
   Qdrant path: ./qdrant_db


---
## Cell 3: Specify PDF File Path
Enter the path to your PDF file.

In [14]:
import os

# =============================================================================
# üìÅ ENTER YOUR PDF FILE PATH HERE
# =============================================================================
PDF_FILE_PATH = "./documents/jd1.pdf"  

# Or specify multiple files as a list:
# PDF_FILE_PATHS = [
#     "./documents/file1.pdf",
#     "./documents/file2.pdf",
# ]

# Convert to list for unified processing
if isinstance(PDF_FILE_PATH, str):
    pdf_files = [PDF_FILE_PATH]
else:
    pdf_files = PDF_FILE_PATH

# Validate files exist
valid_files = []
for filepath in pdf_files:
    if os.path.exists(filepath):
        valid_files.append(filepath)
        print(f"‚úÖ Found: {filepath}")
    else:
        print(f"‚ùå Not found: {filepath}")

if not valid_files:
    print("\n‚ö†Ô∏è  No valid PDF files found! Please check the path above.")
else:
    print(f"\nüìÑ {len(valid_files)} file(s) ready to process")

‚úÖ Found: ./documents/jd1.pdf

üìÑ 1 file(s) ready to process


---
## Cell 4: Document Loader
Extract text from PDF files using pdfplumber.

In [15]:
import pdfplumber
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class Document:
    """Represents a loaded document with metadata."""
    content: str
    metadata: Dict
    source: str
    doc_type: str

def load_pdf(filepath: str) -> Document:
    """Load a single PDF file."""
    text = ""
    page_count = 0
    
    try:
        with pdfplumber.open(filepath) as pdf:
            page_count = len(pdf.pages)
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n\n"
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Error loading {filepath}: {e}")
        return None
    
    return Document(
        content=text.strip(),
        metadata={
            "page_count": page_count,
            "file_size": os.path.getsize(filepath),
            "filename": os.path.basename(filepath)
        },
        source=filepath,
        doc_type="pdf"
    )

# Load all PDF files
print("üìÑ Loading documents...")
documents = []

for filepath in valid_files:
    if filepath.lower().endswith('.pdf'):
        doc = load_pdf(filepath)
        if doc:
            documents.append(doc)
            # print("  ‚úì Loaded: ", doc.content)
            print(f"   ‚úì Loaded: {doc.metadata['filename']} ({len(doc.content)} chars, {doc.metadata['page_count']} pages)")

print(f"\n‚úÖ Loaded {len(documents)} document(s)!")

üìÑ Loading documents...
   ‚úì Loaded: jd1.pdf (29705 chars, 10 pages)

‚úÖ Loaded 1 document(s)!


---
## Cell 5: Semantic Chunker
Split documents into meaningful chunks for embedding.

In [16]:
# import re
# from dataclasses import dataclass
# from typing import List, Dict

# @dataclass
# class Chunk:
#     """Represents a text chunk with metadata."""
#     text: str
#     metadata: Dict
#     chunk_index: int

# def clean_text(text: str) -> str:
#     """Clean and normalize text."""
#     text = re.sub(r'\b\d+:\d+\b', '', text)  # Remove timestamps
#     text = re.sub(r'\n\s*\d+\s*\n', '\n', text)  # Remove page numbers
#     text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
#     return text.strip()

# def split_into_sentences(text: str) -> List[str]:
#     """Split text into sentences."""
#     text = re.sub(r'\s+', ' ', text)
#     sentences = re.split(r'(?<=[.!?])\s+', text)
#     return [s.strip() for s in sentences if s.strip()]

# def merge_small_chunks(chunks: List[str]) -> List[str]:
#     """Merge chunks that are too small."""
#     if not chunks:
#         return []
    
#     merged = []
#     current = chunks[0]
    
#     for chunk in chunks[1:]:
#         if len(current) < CHUNK_MIN_SIZE:
#             current += ' ' + chunk
#         else:
#             merged.append(current)
#             current = chunk
    
#     merged.append(current)
#     return merged

# def chunk_document(content: str, source: str) -> List[Chunk]:
#     """Chunk a document using adaptive strategy."""
#     # Remove timestamps
#     content = re.sub(r'\b\d+:\d+\b', '', content)
    
#     # Check if text has proper punctuation
#     has_punctuation = any(content.count(p) > 5 for p in ['.', '!', '?'])
    
#     if has_punctuation:
#         # Sentence-based chunking for well-formatted text
#         cleaned = clean_text(content)
#         sentences = split_into_sentences(cleaned)
        
#         raw_chunks = []
#         current_chunk = []
#         current_size = 0
        
#         for sentence in sentences:
#             sentence_len = len(sentence)
            
#             if current_size + sentence_len > CHUNK_MAX_SIZE and current_chunk:
#                 raw_chunks.append(' '.join(current_chunk))
#                 current_chunk = []
#                 current_size = 0
            
#             current_chunk.append(sentence)
#             current_size += sentence_len + 1
            
#             if current_size >= CHUNK_MAX_SIZE * 0.6 and sentence.endswith(('.', '!', '?')):
#                 raw_chunks.append(' '.join(current_chunk))
#                 current_chunk = []
#                 current_size = 0
        
#         if current_chunk:
#             raw_chunks.append(' '.join(current_chunk))

#         print(raw_chunks)
#     else:
#         # Line-based chunking for transcripts
#         lines = content.split('\n')
#         lines = [l.strip() for l in lines if l.strip()]
        
#         raw_chunks = []
#         current_chunk = []
#         current_size = 0
        
#         for line in lines:
#             line_len = len(line)
            
#             if current_size + line_len > CHUNK_MAX_SIZE and current_chunk:
#                 chunk_text = ' '.join(current_chunk)
#                 chunk_text = re.sub(r'\s+', ' ', chunk_text).strip()
#                 raw_chunks.append(chunk_text)
#                 current_chunk = []
#                 current_size = 0
            
#             current_chunk.append(line)
#             current_size += line_len
            
#             if current_size >= CHUNK_MAX_SIZE * 0.6:
#                 chunk_text = ' '.join(current_chunk)
#                 chunk_text = re.sub(r'\s+', ' ', chunk_text).strip()
#                 raw_chunks.append(chunk_text)
#                 current_chunk = []
#                 current_size = 0
        
#         if current_chunk:
#             chunk_text = ' '.join(current_chunk)
#             chunk_text = re.sub(r'\s+', ' ', chunk_text).strip()
#             raw_chunks.append(chunk_text)
    
#     # Merge small chunks
#     final_chunks = merge_small_chunks(raw_chunks)
    
#     # Create Chunk objects
#     chunks = []
#     for i, text in enumerate(final_chunks):
#         if text and len(text) > 50:
#             chunks.append(Chunk(
#                 text=text,
#                 metadata={
#                     "source": source,
#                     "chunk_index": i,
#                     "total_chunks": len(final_chunks),
#                     "char_count": len(text)
#                 },
#                 chunk_index=i
#             ))
    
#     return chunks

# # Chunk all documents
# print("‚úÇÔ∏è  Chunking documents...")
# all_chunks = []

# for doc in documents:
#     chunks = chunk_document(doc.content, doc.source)
#     all_chunks.extend(chunks)
#     # print(f"   ‚úì {doc.metadata['filename']}: {len(chunks)} chunks")

# print(f"\n‚úÖ Created {len(all_chunks)} total chunks!")

# # Preview first chunk
# if all_chunks:
#     print(f"\nüìù Sample chunk (first 200 chars):")
#     # print(f"   '{all_chunks[0]}...'")

In [17]:
# =============================================================================
# üî• ADVANCED CHUNKING ENGINE - THE CHUNKING KING
# =============================================================================

import re
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
from collections import defaultdict

@dataclass
class Chunk:
    """Represents a text chunk with rich metadata."""
    text: str
    metadata: Dict
    chunk_index: int
    parent_text: Optional[str] = None  # For parent-child chunking
    neighbors: List[int] = field(default_factory=list)  # Adjacent chunk indices

# =============================================================================
# UTILITY FUNCTIONS
# =============================================================================

def clean_text(text: str) -> str:
    """Clean and normalize text."""
    text = re.sub(r'\b\d+:\d+\b', '', text)  # Remove timestamps
    text = re.sub(r'\n\s*\d+\s*\n', '\n', text)  # Remove page numbers
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    text = re.sub(r'\s+', ' ', text)
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    v1, v2 = np.array(vec1), np.array(vec2)
    norm1, norm2 = np.linalg.norm(v1), np.linalg.norm(v2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return float(np.dot(v1, v2) / (norm1 * norm2))

# =============================================================================
# CHUNKING STRATEGIES
# =============================================================================

def chunk_fixed_size(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """
    Strategy 1: Fixed-size chunking with overlap.
    Simple but fast. Good for uniform content.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Try to break at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.5:
                end = start + last_period + 1
                chunk = text[start:end]
        
        chunks.append(chunk.strip())
        start = end - overlap  # Overlap for context continuity
    
    return [c for c in chunks if len(c) > 50]

def chunk_sentence_based(text: str, min_size: int = 200, max_size: int = 800) -> List[str]:
    """
    Strategy 2: Sentence-based chunking.
    Respects sentence boundaries. Good for essays/books.
    """
    sentences = split_into_sentences(text)
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sentence in sentences:
        sentence_len = len(sentence)
        
        # If adding this sentence exceeds max, save current chunk
        if current_size + sentence_len > max_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_size = 0
        
        current_chunk.append(sentence)
        current_size += sentence_len + 1
        
        # If chunk is large enough and ends with punctuation, save it
        if current_size >= min_size and sentence.endswith(('.', '!', '?')):
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_size = 0
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return [c for c in chunks if len(c) > 50]

def chunk_by_speaker(text: str) -> List[str]:
    """
    Strategy 3: Speaker-based chunking for dialogues.
    Keeps each speaker's complete thought together.
    Best for Krishnamurti Q&A format!
    """
    # Pattern to detect speaker turns
    speaker_pattern = r'(Questioner\s*:|Krishnamurti\s*:|Q\s*:|K\s*:)'
    
    # Split by speaker
    parts = re.split(speaker_pattern, text)
    
    chunks = []
    current_speaker = ""
    
    for i, part in enumerate(parts):
        if re.match(speaker_pattern, part):
            current_speaker = part
        elif part.strip():
            # Combine speaker label with their content
            chunk = f"{current_speaker} {part.strip()}"
            chunks.append(chunk)
    
    return [c for c in chunks if len(c) > 50]

def chunk_by_paragraph(text: str, max_size: int = 1000) -> List[str]:
    """
    Strategy 4: Paragraph-based chunking.
    Good for structured documents with clear paragraphs.
    """
    paragraphs = text.split('\n\n')
    paragraphs = [p.strip() for p in paragraphs if p.strip()]
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for para in paragraphs:
        para_len = len(para)
        
        if current_size + para_len > max_size and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = []
            current_size = 0
        
        current_chunk.append(para)
        current_size += para_len
    
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return [c for c in chunks if len(c) > 50]

def chunk_semantic(text: str, threshold: float = 0.75, min_size: int = 200, max_size: int = 1000) -> List[str]:
    """
    Strategy 5: Semantic chunking using embeddings.
    Splits when topic changes significantly.
    Most intelligent but slowest!
    """
    sentences = split_into_sentences(text)
    
    if len(sentences) < 2:
        return [text] if len(text) > 50 else []
    
    print("   üß† Computing sentence embeddings for semantic chunking...")
    
    # Embed sentences in batches
    embeddings = []
    batch_size = 20
    for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        batch_embeddings = [embed_text(s) for s in batch]
        embeddings.extend(batch_embeddings)
    
    # Find split points where similarity drops
    chunks = []
    current_chunk = [sentences[0]]
    current_size = len(sentences[0])
    
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        sentence_len = len(sentences[i])
        
        # Split if: topic changed AND chunk is big enough
        should_split = (
            similarity < threshold and 
            current_size >= min_size
        ) or (current_size + sentence_len > max_size)
        
        if should_split and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
            current_size = sentence_len
        else:
            current_chunk.append(sentences[i])
            current_size += sentence_len
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return [c for c in chunks if len(c) > 50]

def chunk_recursive(text: str, max_size: int = 800) -> List[str]:
    """
    Strategy 6: Recursive chunking.
    Tries multiple delimiters in order: \n\n ‚Üí \n ‚Üí . ‚Üí space
    Good for mixed content.
    """
    separators = ['\n\n', '\n', '. ', ' ']
    
    def split_recursive(text: str, separators: List[str]) -> List[str]:
        if not separators or len(text) <= max_size:
            return [text] if text.strip() else []
        
        separator = separators[0]
        remaining_separators = separators[1:]
        
        parts = text.split(separator)
        
        chunks = []
        current = []
        current_len = 0
        
        for part in parts:
            part_len = len(part) + len(separator)
            
            if current_len + part_len > max_size and current:
                chunk_text = separator.join(current)
                # If still too big, recurse with next separator
                if len(chunk_text) > max_size:
                    chunks.extend(split_recursive(chunk_text, remaining_separators))
                else:
                    chunks.append(chunk_text)
                current = []
                current_len = 0
            
            current.append(part)
            current_len += part_len
        
        if current:
            chunk_text = separator.join(current)
            if len(chunk_text) > max_size:
                chunks.extend(split_recursive(chunk_text, remaining_separators))
            else:
                chunks.append(chunk_text)
        
        return chunks
    
    return [c.strip() for c in split_recursive(text, separators) if len(c.strip()) > 50]

# =============================================================================
# OVERLAP ADDING
# =============================================================================

def add_overlap(chunks: List[str], overlap_chars: int = 100) -> List[str]:
    """
    Add overlap between consecutive chunks.
    Ensures context is preserved at boundaries.
    """
    if len(chunks) < 2:
        return chunks
    
    overlapped = [chunks[0]]
    
    for i in range(1, len(chunks)):
        # Get last N chars from previous chunk
        prev_end = chunks[i-1][-overlap_chars:] if len(chunks[i-1]) > overlap_chars else chunks[i-1]
        # Prepend to current chunk
        overlapped.append(f"{prev_end}... {chunks[i]}")
    
    return overlapped

# =============================================================================
# PARENT-CHILD CHUNKING
# =============================================================================

def chunk_parent_child(text: str, parent_size: int = 2000, child_size: int = 500) -> Tuple[List[str], List[str]]:
    """
    Strategy 7: Parent-Child chunking.
    Creates large parent chunks and small child chunks.
    Search on children, return parent for context!
    """
    # Create parent chunks
    parents = chunk_fixed_size(text, chunk_size=parent_size, overlap=200)
    
    # Create child chunks from each parent
    all_children = []
    child_to_parent = {}
    
    for parent_idx, parent in enumerate(parents):
        children = chunk_sentence_based(parent, min_size=200, max_size=child_size)
        for child in children:
            child_idx = len(all_children)
            all_children.append(child)
            child_to_parent[child_idx] = parent_idx
    
    return parents, all_children, child_to_parent

# =============================================================================
# MERGE SMALL CHUNKS
# =============================================================================

def merge_small_chunks(chunks: List[str], min_size: int = 200) -> List[str]:
    """Merge chunks that are too small with neighbors."""
    if not chunks:
        return []
    
    merged = []
    current = chunks[0]
    
    for chunk in chunks[1:]:
        if len(current) < min_size:
            current += ' ' + chunk
        else:
            merged.append(current)
            current = chunk
    
    merged.append(current)
    return merged

# =============================================================================
# üî• THE CHUNKING KING - MAIN FUNCTION
# =============================================================================

def advanced_chunk_document(
    content: str, 
    source: str,
    strategy: str = "auto",  # auto, semantic, speaker, sentence, recursive, fixed
    use_overlap: bool = True,
    overlap_chars: int = 100,
    min_chunk_size: int = 200,
    max_chunk_size: int = 800,
    semantic_threshold: float = 0.75,
    verbose: bool = True
) -> List[Chunk]:
    """
    üî• Advanced Document Chunking with multiple strategies!
    
    Strategies:
    - auto: Automatically selects best strategy based on content
    - semantic: Uses embeddings to detect topic changes
    - speaker: Splits by speaker (for Q&A/dialogues)
    - sentence: Respects sentence boundaries
    - recursive: Tries multiple delimiters
    - fixed: Fixed-size with overlap
    
    Features:
    - Smart strategy selection
    - Optional overlap for context
    - Metadata with neighbors
    - Quality filtering
    """
    
    if verbose:
        print(f"\n{'='*60}")
        print(f"‚úÇÔ∏è ADVANCED CHUNKING: {source}")
        print(f"{'='*60}")
    
    # Clean the content
    content = re.sub(r'\b\d+:\d+\b', '', content)  # Remove timestamps
    cleaned = clean_text(content)
    
    # =========================================================================
    # AUTO-DETECT BEST STRATEGY
    # =========================================================================
    if strategy == "auto":
        # Check for speaker patterns (Q&A format)
        has_speakers = bool(re.search(r'(Questioner\s*:|Krishnamurti\s*:|Q\s*:|K\s*:)', content))
        
        # Check for punctuation (well-formatted text)
        has_punctuation = sum(content.count(p) for p in ['.', '!', '?']) > 10
        
        # Check for paragraphs
        has_paragraphs = content.count('\n\n') > 5
        
        if has_speakers:
            strategy = "speaker"
            if verbose:
                print(f"üìù Auto-detected: Speaker-based (Q&A format)")
        elif has_punctuation and len(cleaned) > 5000:
            strategy = "semantic"
            if verbose:
                print(f"üìù Auto-detected: Semantic (long well-formatted text)")
        elif has_paragraphs:
            strategy = "recursive"
            if verbose:
                print(f"üìù Auto-detected: Recursive (structured paragraphs)")
        else:
            strategy = "sentence"
            if verbose:
                print(f"üìù Auto-detected: Sentence-based (default)")
    
    # =========================================================================
    # APPLY SELECTED STRATEGY
    # =========================================================================
    if verbose:
        print(f"üîß Using strategy: {strategy}")
    
    if strategy == "semantic":
        raw_chunks = chunk_semantic(cleaned, threshold=semantic_threshold, 
                                    min_size=min_chunk_size, max_size=max_chunk_size)
    elif strategy == "speaker":
        raw_chunks = chunk_by_speaker(content)  # Use original to preserve speaker labels
        # Further split if too long
        split_chunks = []
        for chunk in raw_chunks:
            if len(chunk) > max_chunk_size:
                split_chunks.extend(chunk_sentence_based(chunk, min_chunk_size, max_chunk_size))
            else:
                split_chunks.append(chunk)
        raw_chunks = split_chunks
    elif strategy == "sentence":
        raw_chunks = chunk_sentence_based(cleaned, min_chunk_size, max_chunk_size)
    elif strategy == "recursive":
        raw_chunks = chunk_recursive(cleaned, max_chunk_size)
    elif strategy == "fixed":
        raw_chunks = chunk_fixed_size(cleaned, max_chunk_size, overlap_chars)
    elif strategy == "paragraph":
        raw_chunks = chunk_by_paragraph(cleaned, max_chunk_size)
    else:
        raw_chunks = chunk_sentence_based(cleaned, min_chunk_size, max_chunk_size)
    
    if verbose:
        print(f"   ‚úì Created {len(raw_chunks)} raw chunks")
    
    # =========================================================================
    # MERGE SMALL CHUNKS
    # =========================================================================
    merged_chunks = merge_small_chunks(raw_chunks, min_chunk_size)
    
    if verbose and len(merged_chunks) != len(raw_chunks):
        print(f"   ‚úì Merged to {len(merged_chunks)} chunks")
    
    # =========================================================================
    # ADD OVERLAP (Optional)
    # =========================================================================
    if use_overlap and strategy != "fixed":  # Fixed already has overlap
        final_texts = add_overlap(merged_chunks, overlap_chars)
        if verbose:
            print(f"   ‚úì Added {overlap_chars} char overlap")
    else:
        final_texts = merged_chunks
    
    # =========================================================================
    # CREATE CHUNK OBJECTS WITH METADATA
    # =========================================================================
    chunks = []
    for i, text in enumerate(final_texts):
        if text and len(text) > 50:
            # Find neighbors
            neighbors = []
            if i > 0:
                neighbors.append(i - 1)
            if i < len(final_texts) - 1:
                neighbors.append(i + 1)
            
            chunks.append(Chunk(
                text=text,
                metadata={
                    "source": source,
                    "chunk_index": i,
                    "total_chunks": len(final_texts),
                    "char_count": len(text),
                    "strategy": strategy,
                    "has_overlap": use_overlap
                },
                chunk_index=i,
                neighbors=neighbors
            ))
    
    if verbose:
        avg_size = sum(len(c.text) for c in chunks) / len(chunks) if chunks else 0
        print(f"   ‚úì Final: {len(chunks)} chunks (avg {avg_size:.0f} chars)")
        print(f"{'='*60}")
    
    return chunks

# =============================================================================
# BATCH DOCUMENT PROCESSING
# =============================================================================

def chunk_all_documents(
    documents: List,
    strategy: str = "auto",
    use_overlap: bool = True,
    verbose: bool = True
) -> List[Chunk]:
    """Process multiple documents with advanced chunking."""
    
    print("\n" + "üî•" * 20)
    print("CHUNKING KING - Processing Documents")
    print("üî•" * 20)
    
    all_chunks = []
    
    for doc in documents:
        chunks = advanced_chunk_document(
            content=doc.content,
            source=doc.source,
            strategy=strategy,
            use_overlap=use_overlap,
            verbose=verbose
        )
        all_chunks.extend(chunks)
        print(f"   ‚úì {doc.metadata.get('filename', 'Unknown')}: {len(chunks)} chunks")
    
    print(f"\n{'='*60}")
    print(f"‚úÖ TOTAL: {len(all_chunks)} chunks from {len(documents)} documents")
    print(f"{'='*60}")
    
    return all_chunks

# =============================================================================
# RUN CHUNKING
# =============================================================================

print("\nüî• CHUNKING KING is ready!")
print("   ‚úÖ Semantic chunking (embedding-based)")
print("   ‚úÖ Speaker-based (for Q&A)")
print("   ‚úÖ Sentence-based")
print("   ‚úÖ Recursive")
print("   ‚úÖ Fixed with overlap")
print("   ‚úÖ Auto-detection")

# Chunk documents
all_chunks = chunk_all_documents(documents, strategy="auto", use_overlap=True)

# Show sample
if all_chunks:
    print(f"\nüìù Sample chunk:")
    print(f"   Strategy: {all_chunks[0].metadata.get('strategy')}")
    print(f"   Size: {len(all_chunks[0].text)} chars")
    print(f"   Text: '{all_chunks[0].text[:200]}...'")


üî• CHUNKING KING is ready!
   ‚úÖ Semantic chunking (embedding-based)
   ‚úÖ Speaker-based (for Q&A)
   ‚úÖ Sentence-based
   ‚úÖ Recursive
   ‚úÖ Fixed with overlap
   ‚úÖ Auto-detection

üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•
CHUNKING KING - Processing Documents
üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•üî•

‚úÇÔ∏è ADVANCED CHUNKING: ./documents/jd1.pdf
üìù Auto-detected: Speaker-based (Q&A format)
üîß Using strategy: speaker
   ‚úì Created 110 raw chunks
   ‚úì Merged to 96 chunks
   ‚úì Added 100 char overlap
   ‚úì Final: 96 chunks (avg 407 chars)
   ‚úì jd1.pdf: 96 chunks

‚úÖ TOTAL: 96 chunks from 1 documents

üìù Sample chunk:
   Strategy: speaker
   Size: 284 chars
   Text: ' J. Krishnamurti's Fifth Public Discussion in London, 1965----- Krishnamurti: If I may, I‚Äôd like to go on with what we were talking the other day; that is, if you‚Äôre not bored with what we have talked...'


---
## Cell 6: Generate Embeddings
Convert text chunks to vector embeddings using OpenAI.

In [18]:
from openai import OpenAI
from typing import List

# Initialize OpenAI client
client = OpenAI()

def embed_text(text: str) -> List[float]:
    """Embed a single text string."""
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=text
    )
    return response.data[0].embedding

def embed_texts(texts: List[str], batch_size: int = 50) -> List[List[float]]:
    """Embed multiple texts with batching."""
    all_embeddings = []
    total = len(texts)
    
    for i in range(0, total, batch_size):
        batch = texts[i:i + batch_size]
        
        response = client.embeddings.create(
            model=EMBED_MODEL,
            input=batch
        )
        
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        
        processed = min(i + batch_size, total)
        print(f"   ‚úì Embedded {processed}/{total} chunks")
    
    return all_embeddings

# Generate embeddings for all chunks
print("üß† Generating embeddings...")
texts = [chunk.text for chunk in all_chunks]
embeddings = embed_texts(texts)

print(f"\n‚úÖ Generated {len(embeddings)} embeddings!")
print(f"   Embedding dimension: {len(embeddings[0])}")

üß† Generating embeddings...
   ‚úì Embedded 50/96 chunks
   ‚úì Embedded 96/96 chunks

‚úÖ Generated 96 embeddings!
   Embedding dimension: 3072


---
## Cell 7: Store in Qdrant Vector Database
Save embeddings to local Qdrant database.

In [19]:
import shutil
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, 
    Distance, 
    PointStruct,
    ScalarQuantizationConfig,
    ScalarType
)
# üßπ Clean up any existing database (prevents lock errors on re-run)
if os.path.exists(QDRANT_PATH):
    shutil.rmtree(QDRANT_PATH)
    print(f"üßπ Cleaned up existing database at: {QDRANT_PATH}")
# Create local Qdrant client (file-based, no server needed)
os.makedirs(QDRANT_PATH, exist_ok=True)
qdrant_client = QdrantClient(path=QDRANT_PATH)
print(f"üì¶ Created local Qdrant at: {QDRANT_PATH}")

# Delete existing collection if it exists
try:
    qdrant_client.delete_collection(COLLECTION_NAME)
    print(f"   üóëÔ∏è  Deleted existing collection: {COLLECTION_NAME}")
except:
    pass

# Create collection
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=EMBED_DIMENSION,
        distance=Distance.COSINE
    ),
    quantization_config=ScalarQuantizationConfig(
        type=ScalarType.INT8,
        always_ram=True
    )
)
print(f"   ‚úì Created collection: {COLLECTION_NAME}")

# Prepare points
points = []
for i, (chunk, embedding) in enumerate(zip(all_chunks, embeddings)):
    points.append(PointStruct(
        id=i,
        vector=embedding,
        payload={
            "text": chunk.text,
            "source": chunk.metadata.get("source", ""),
            "chunk_index": chunk.chunk_index,
            "metadata": chunk.metadata
        }
    ))

# Upsert points in batches
batch_size = 100
for i in range(0, len(points), batch_size):
    batch = points[i:i + batch_size]
    qdrant_client.upsert(
        collection_name=COLLECTION_NAME,
        points=batch
    )

# Get collection info
info = qdrant_client.get_collection(COLLECTION_NAME)
print(f"\n‚úÖ Stored {info.points_count} vectors in Qdrant!")

üßπ Cleaned up existing database at: ./qdrant_db
üì¶ Created local Qdrant at: ./qdrant_db
   üóëÔ∏è  Deleted existing collection: rag_collection
   ‚úì Created collection: rag_collection

‚úÖ Stored 96 vectors in Qdrant!


---
## Cell 8: Build BM25 Index (Keyword Search)
Create a BM25 index for hybrid search.

In [20]:
import pickle
import json
from rank_bm25 import BM25Okapi

def tokenize(text: str) -> List[str]:
    """Simple tokenization for BM25."""
    tokens = re.findall(r'\b\w+\b', text.lower())
    return [t for t in tokens if len(t) > 2]

# Build BM25 index
print("üìö Building BM25 index...")

corpus_texts = [chunk.text for chunk in all_chunks]
tokenized_corpus = [tokenize(text) for text in corpus_texts]
bm25_index = BM25Okapi(tokenized_corpus)

# Save index and corpus
BM25_INDEX_PATH = os.path.join(QDRANT_PATH, "bm25_index.pkl")
BM25_CORPUS_PATH = os.path.join(QDRANT_PATH, "bm25_corpus.json")

with open(BM25_INDEX_PATH, 'wb') as f:
    pickle.dump(bm25_index, f)

with open(BM25_CORPUS_PATH, 'w') as f:
    json.dump(corpus_texts, f)

print(f"\n‚úÖ BM25 index built with {len(corpus_texts)} documents!")

üìö Building BM25 index...

‚úÖ BM25 index built with 96 documents!


---
## Cell 9: Query Engine (Hybrid Search)
Define functions for searching and retrieving context.

In [21]:
# import numpy as np
# from typing import List, Dict, Tuple

# def normalize_scores(scores: List[float]) -> List[float]:
#     """Normalize scores to 0-1 range."""
#     if not scores:
#         return []
#     min_score = min(scores)
#     max_score = max(scores)
#     if max_score == min_score:
#         return [1.0] * len(scores)
#     return [(s - min_score) / (max_score - min_score) for s in scores]

# def search_vectors(query_embedding: List[float], limit: int = 5) -> List[Dict]:
#     """Search vectors in Qdrant."""
#     results = qdrant_client.query_points(
#         collection_name=COLLECTION_NAME,
#         query=query_embedding,
#         limit=limit,
#     )
    
#     formatted = []
#     for point in results.points:
#         formatted.append({
#             "text": point.payload.get("text", ""),
#             "score": point.score,
#             "source": point.payload.get("source", ""),
#             "metadata": point.payload.get("metadata", {}),
#         })
#     return formatted

# def search_bm25(query: str, top_k: int = 10) -> List[Tuple[int, float]]:
#     """Search using BM25."""
#     query_tokens = tokenize(query)
#     if not query_tokens:
#         return []
    
#     scores = bm25_index.get_scores(query_tokens)
#     top_indices = np.argsort(scores)[::-1][:top_k]
    
#     results = []
#     for idx in top_indices:
#         if scores[idx] > 0:
#             results.append((int(idx), float(scores[idx])))
#     return results

# def hybrid_search(query: str, top_k: int = 5) -> List[Dict]:
#     """Perform hybrid search combining vector and BM25."""
#     print(f"\nüîç Searching for: '{query}'")
#     print("-" * 50)
    
#     # Vector search
#     query_embedding = embed_text(query)
#     vector_results = search_vectors(query_embedding, limit=top_k * 2)
#     print(f"   Vector search: {len(vector_results)} results")
    
#     # BM25 search
#     bm25_results = search_bm25(query, top_k=top_k * 2)
#     print(f"   BM25 search: {len(bm25_results)} results")
    
#     # Combine results
#     text_scores = {}
#     text_metadata = {}
    
#     # Process vector results
#     vector_scores_list = [r["score"] for r in vector_results]
#     normalized_vector = normalize_scores(vector_scores_list)
    
#     for i, result in enumerate(vector_results):
#         text = result["text"]
#         score = normalized_vector[i] if normalized_vector else 0
#         text_scores[text] = {"vector": score, "bm25": 0, "combined": 0}
#         text_metadata[text] = {
#             "source": result.get("source", ""),
#             "metadata": result.get("metadata", {})
#         }
    
#     # Process BM25 results
#     bm25_scores_raw = [score for _, score in bm25_results]
#     normalized_bm25 = normalize_scores(bm25_scores_raw)
    
#     for i, (idx, _) in enumerate(bm25_results):
#         text = corpus_texts[idx]
#         score = normalized_bm25[i] if normalized_bm25 else 0
#         if text in text_scores:
#             text_scores[text]["bm25"] = score
#         else:
#             text_scores[text] = {"vector": 0, "bm25": score, "combined": 0}
#             text_metadata[text] = {"source": "", "metadata": {}}
    
#     # Calculate combined scores
#     for text in text_scores:
#         vs = text_scores[text]["vector"]
#         bs = text_scores[text]["bm25"]
#         text_scores[text]["combined"] = (vs * VECTOR_WEIGHT) + (bs * BM25_WEIGHT)
    
#     # Sort and format results
#     sorted_texts = sorted(
#         text_scores.keys(),
#         key=lambda t: text_scores[t]["combined"],
#         reverse=True
#     )
    
#     results = []
#     for text in sorted_texts[:top_k]:
#         results.append({
#             "text": text,
#             "score": text_scores[text]["combined"],
#             **text_metadata.get(text, {})
#         })
    
#     print(f"   Combined: {len(results)} results")
#     print("-" * 50)
    
#     return results

# def get_context(question: str, k: int = 3) -> str:
#     """Get formatted context for RAG."""
#     results = hybrid_search(question, top_k=k)
    
#     if not results:
#         return "No relevant context found."
    
#     context_parts = []
#     for i, result in enumerate(results, 1):
#         source = os.path.basename(result.get("source", "Unknown"))
#         text = result.get("text", "")
#         score = result.get("score", 0)
#         context_parts.append(f"[Source {i}: {source} | Score: {score:.2f}]\n{text}")
    
#     return "\n\n---\n\n".join(context_parts)

# print("‚úÖ Query engine ready!")

In [22]:
# =============================================================================
# üî• ADVANCED HYBRID SEARCH ENGINE - THE SEARCH KING
# =============================================================================

import numpy as np
from typing import List, Dict, Tuple, Optional
from collections import defaultdict

# Try to import reranker (optional)
try:
    from sentence_transformers import CrossEncoder
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    HAS_RERANKER = True
    print("‚úÖ Reranker loaded!")
except:
    HAS_RERANKER = False
    print("‚ö†Ô∏è Reranker not available (install sentence-transformers for better results)")

# =============================================================================
# CORE UTILITIES
# =============================================================================

def normalize_scores(scores: List[float]) -> List[float]:
    """Normalize scores to 0-1 range."""
    if not scores:
        return []
    min_score, max_score = min(scores), max(scores)
    if max_score == min_score:
        return [1.0] * len(scores)
    return [(s - min_score) / (max_score - min_score) for s in scores]

def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    v1, v2 = np.array(vec1), np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# =============================================================================
# QUERY ENHANCEMENT
# =============================================================================

def expand_query_simple(query: str) -> List[str]:
    """Simple query expansion using common patterns."""
    expansions = [query]
    
    # Add variations
    if "what is" in query.lower():
        expansions.append(query.lower().replace("what is", "explain"))
        expansions.append(query.lower().replace("what is", "describe"))
    
    if "how" in query.lower():
        expansions.append(query.lower().replace("how", "what is the way"))
    
    # Add keywords
    keywords = ["Krishnamurti", "meditation", "awareness", "thought", "observer"]
    for kw in keywords:
        if kw.lower() in query.lower():
            expansions.append(kw)
    
    return list(set(expansions))

def hyde_search(query: str) -> List[float]:
    """
    HyDE: Hypothetical Document Embeddings
    Instead of embedding the question, embed a hypothetical answer.
    """
    # Create a hypothetical answer (simple version without LLM)
    hypothetical = f"""
    Krishnamurti addresses this question about {query.lower().replace('?', '')}.
    He speaks about the nature of awareness and the observation of the mind.
    The key insight is that true understanding comes not from analysis or method,
    but from direct perception without the interference of thought.
    """
    return embed_text(hypothetical)

# =============================================================================
# SEARCH FUNCTIONS
# =============================================================================

def search_vectors(query_embedding: List[float], limit: int = 10) -> List[Dict]:
    """Search vectors in Qdrant."""
    results = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        query=query_embedding,
        limit=limit,
    )
    
    formatted = []
    for point in results.points:
        formatted.append({
            "text": point.payload.get("text", ""),
            "score": point.score,
            "source": point.payload.get("source", ""),
            "metadata": point.payload.get("metadata", {}),
            "chunk_index": point.payload.get("metadata", {}).get("chunk_index", 0)
        })
    return formatted

def search_bm25(query: str, top_k: int = 10) -> List[Tuple[int, float]]:
    """Search using BM25."""
    query_tokens = tokenize(query)
    if not query_tokens:
        return []
    
    scores = bm25_index.get_scores(query_tokens)
    top_indices = np.argsort(scores)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        if scores[idx] > 0:
            results.append((int(idx), float(scores[idx])))
    return results

# =============================================================================
# ADVANCED FUSION & RANKING
# =============================================================================

def reciprocal_rank_fusion(results_lists: List[List[Dict]], k: int = 60) -> Dict[str, float]:
    """
    Reciprocal Rank Fusion - combines multiple result lists.
    Better than simple score averaging!
    """
    fused_scores = defaultdict(float)
    
    for results in results_lists:
        for rank, doc in enumerate(results):
            doc_id = doc["text"]
            fused_scores[doc_id] += 1.0 / (k + rank + 1)
    
    return dict(fused_scores)

def rerank_results(query: str, results: List[Dict], top_k: int = 5) -> List[Dict]:
    """Rerank results using cross-encoder (if available)."""
    if not HAS_RERANKER or not results:
        return results[:top_k]
    
    pairs = [[query, r["text"]] for r in results]
    scores = reranker.predict(pairs)
    
    # Combine with original scores (60% rerank, 40% original)
    for i, result in enumerate(results):
        result["rerank_score"] = float(scores[i])
        result["final_score"] = 0.6 * float(scores[i]) + 0.4 * result.get("score", 0)
    
    ranked = sorted(results, key=lambda x: x["final_score"], reverse=True)
    return ranked[:top_k]

def get_neighbor_chunks(chunk_indices: List[int], window: int = 1) -> List[int]:
    """Get neighboring chunk indices for context."""
    neighbors = set()
    for idx in chunk_indices:
        for offset in range(-window, window + 1):
            neighbor_idx = idx + offset
            if 0 <= neighbor_idx < len(all_chunks):
                neighbors.add(neighbor_idx)
    return sorted(neighbors)

# =============================================================================
# üî• THE SEARCH KING - MAIN FUNCTION
# =============================================================================

def advanced_hybrid_search(
    query: str, 
    top_k: int = 5,
    use_hyde: bool = True,
    use_query_expansion: bool = True,
    use_reranking: bool = True,
    use_neighbors: bool = True,
    verbose: bool = True
) -> List[Dict]:
    """
    üî• Advanced Hybrid Search with all the bells and whistles!
    
    Features:
    - Query expansion (multiple query variations)
    - HyDE (hypothetical document embeddings)
    - Vector search (semantic)
    - BM25 search (keyword)
    - Reciprocal Rank Fusion
    - Cross-encoder reranking
    - Neighbor chunk retrieval
    """
    
    if verbose:
        print(f"\n{'='*60}")
        print(f"üîç ADVANCED SEARCH: '{query}'")
        print(f"{'='*60}")
    
    all_results = []
    text_metadata = {}
    
    # =========================================================================
    # STEP 1: Query Expansion
    # =========================================================================
    queries = [query]
    if use_query_expansion:
        queries = expand_query_simple(query)
        if verbose:
            print(f"üìù Query variations: {len(queries)}")
    
    # =========================================================================
    # STEP 2: Multi-Query Vector Search
    # =========================================================================
    for q in queries:
        # Standard embedding search
        q_embedding = embed_text(q)
        vector_results = search_vectors(q_embedding, limit=top_k * 2)
        all_results.append(vector_results)
        
        # Store metadata
        for r in vector_results:
            text_metadata[r["text"]] = {
                "source": r.get("source", ""),
                "metadata": r.get("metadata", {}),
                "chunk_index": r.get("chunk_index", 0)
            }
    
    if verbose:
        print(f"üß† Vector search: {sum(len(r) for r in all_results)} results")
    
    # =========================================================================
    # STEP 3: HyDE Search (Hypothetical Document Embeddings)
    # =========================================================================
    if use_hyde:
        hyde_embedding = hyde_search(query)
        hyde_results = search_vectors(hyde_embedding, limit=top_k * 2)
        all_results.append(hyde_results)
        
        for r in hyde_results:
            if r["text"] not in text_metadata:
                text_metadata[r["text"]] = {
                    "source": r.get("source", ""),
                    "metadata": r.get("metadata", {}),
                    "chunk_index": r.get("chunk_index", 0)
                }
        
        if verbose:
            print(f"üéØ HyDE search: {len(hyde_results)} results")
    
    # =========================================================================
    # STEP 4: BM25 Keyword Search
    # =========================================================================
    bm25_formatted = []
    for q in queries:
        bm25_results = search_bm25(q, top_k=top_k * 2)
        for idx, score in bm25_results:
            text = corpus_texts[idx]
            bm25_formatted.append({
                "text": text,
                "score": score,
                "chunk_index": idx
            })
            if text not in text_metadata:
                text_metadata[text] = {
                    "source": all_chunks[idx].metadata.get("source", "") if idx < len(all_chunks) else "",
                    "metadata": all_chunks[idx].metadata if idx < len(all_chunks) else {},
                    "chunk_index": idx
                }
    
    all_results.append(bm25_formatted)
    
    if verbose:
        print(f"üî§ BM25 search: {len(bm25_formatted)} results")
    
    # =========================================================================
    # STEP 5: Reciprocal Rank Fusion
    # =========================================================================
    fused_scores = reciprocal_rank_fusion(all_results)
    
    # Create combined results
    combined_results = []
    for text, rrf_score in fused_scores.items():
        combined_results.append({
            "text": text,
            "score": rrf_score,
            **text_metadata.get(text, {})
        })
    
    # Sort by RRF score
    combined_results = sorted(combined_results, key=lambda x: x["score"], reverse=True)
    
    if verbose:
        print(f"üîó Fusion: {len(combined_results)} unique results")
    
    # =========================================================================
    # STEP 6: Neighbor Chunk Retrieval
    # =========================================================================
    if use_neighbors and combined_results:
        # Get top chunk indices
        top_indices = [r.get("chunk_index", 0) for r in combined_results[:top_k]]
        neighbor_indices = get_neighbor_chunks(top_indices, window=1)
        
        # Add neighbor chunks if not already in results
        existing_texts = {r["text"] for r in combined_results}
        for idx in neighbor_indices:
            if idx < len(all_chunks):
                chunk = all_chunks[idx]
                if chunk.text not in existing_texts:
                    combined_results.append({
                        "text": chunk.text,
                        "score": 0.1,  # Lower score for neighbors
                        "source": chunk.metadata.get("source", ""),
                        "metadata": chunk.metadata,
                        "chunk_index": idx,
                        "is_neighbor": True
                    })
        
        if verbose:
            print(f"üìç Added {len(neighbor_indices) - len(top_indices)} neighbor chunks")
    
    # =========================================================================
    # STEP 7: Reranking (if available)
    # =========================================================================
    if use_reranking and HAS_RERANKER:
        combined_results = rerank_results(query, combined_results, top_k=top_k * 2)
        if verbose:
            print(f"‚ö° Reranked with cross-encoder")
    
    # Final top-k
    final_results = combined_results[:top_k]
    
    if verbose:
        print(f"{'='*60}")
        print(f"‚úÖ Returning top {len(final_results)} results")
        print(f"{'='*60}")
    
    return final_results

# =============================================================================
# CONTEXT RETRIEVAL
# =============================================================================

def get_context_advanced(question: str, k: int = 3, verbose: bool = True) -> str:
    """Get formatted context using advanced search."""
    results = advanced_hybrid_search(
        question, 
        top_k=k,
        use_hyde=True,
        use_query_expansion=True,
        use_reranking=False,  # ‚Üê Change this to False
        use_neighbors=True,
        verbose=verbose
    )
    
    if not results:
        return "No relevant context found."
    
    context_parts = []
    for i, result in enumerate(results, 1):
        source = os.path.basename(result.get("source", "Unknown"))
        text = result.get("text", "")
        score = result.get("score", 0)
        neighbor_tag = " [NEIGHBOR]" if result.get("is_neighbor") else ""
        context_parts.append(f"[Source {i}: {source} | Score: {score:.3f}{neighbor_tag}]\n{text}")
    
    return "\n\n---\n\n".join(context_parts)

# Keep original functions as aliases
hybrid_search = advanced_hybrid_search
get_context = get_context_advanced

print("\nüî• SEARCH KING is ready!")
print(f"   ‚úÖ Query expansion: ON")
print(f"   ‚úÖ HyDE search: ON")
print(f"   ‚úÖ Hybrid (Vector + BM25): ON")
print(f"   ‚úÖ Reciprocal Rank Fusion: ON")
print(f"   ‚úÖ Neighbor chunks: ON")
print(f"   {'‚úÖ' if HAS_RERANKER else '‚ö†Ô∏è'} Reranking: {'ON' if HAS_RERANKER else 'OFF (install sentence-transformers)'}")

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


‚úÖ Reranker loaded!

üî• SEARCH KING is ready!
   ‚úÖ Query expansion: ON
   ‚úÖ HyDE search: ON
   ‚úÖ Hybrid (Vector + BM25): ON
   ‚úÖ Reciprocal Rank Fusion: ON
   ‚úÖ Neighbor chunks: ON
   ‚úÖ Reranking: ON


---
## Cell 10: üéØ Test Your RAG System!
Enter a query to search your documents.

In [23]:
# ============================================
# üéØ ENTER YOUR QUERY HERE
# ============================================
query = "What is meditation?"  # <-- Change this to your question!

# Get context
context = get_context(query, k=3)

print("\n" + "=" * 60)
print("üìö RETRIEVED CONTEXT")
print("=" * 60)
print(context)
print("\n" + "=" * 60)


üîç ADVANCED SEARCH: 'What is meditation?'
üìù Query variations: 4
üß† Vector search: 24 results
üéØ HyDE search: 6 results
üî§ BM25 search: 24 results
üîó Fusion: 12 unique results
üìç Added 6 neighbor chunks
‚úÖ Returning top 3 results

üìö RETRIEVED CONTEXT
[Source 1: jd1.pdf | Score: 0.136]
So then how is ‚Äì I‚Äôm using the ‚Äòhow‚Äô merely as a question ‚Äì then how is one to find oneself in that?... Now, I think here comes the question of meditation. I‚Äôm not... we are not talking of meditation as a method ‚Äì you understand? ‚Äì therefore it‚Äôs not... it has nothing whatever to do with method because method is the ‚Äòhow‚Äô and we have pushed that aside as being inadequate, immature, juvenile.

---

[Source 2: jd1.pdf | Score: 0.120]
r to arrive at that quietness is called generally meditation, which of course is too childish and...... it‚Äôs too absurd. So... but yet I see the mind must be extraordinarily quiet because I know that any movement in any direction, at a

---
## üéâ Congratulations!

You have successfully built an end-to-end RAG pipeline!

### What You Learned:
1. **Loading** - Extract text from PDFs using pdfplumber
2. **Chunking** - Split text into semantic pieces
3. **Embedding** - Convert text to vectors using OpenAI
4. **Vector Storage** - Store in Qdrant database
5. **Hybrid Search** - Combine semantic + keyword search
6. **Retrieval** - Get relevant context for any query

### Next Steps:
- Try different queries in Cell 10
- Add more PDF files and rebuild the index
- Adjust `CHUNK_MIN_SIZE` and `CHUNK_MAX_SIZE` in Cell 2
- Experiment with `VECTOR_WEIGHT` and `BM25_WEIGHT`