# Advanced RAG Build: Semantic Chunking vs Naive Chunking Evaluation

This notebook implements and compares two RAG approaches:
1. **Baseline**: LangGraph RAG with Naive Retrieval (RecursiveCharacterTextSplitter)
2. **Advanced**: LangGraph RAG with Semantic Chunking + Naive Retrieval

## Evaluation Metrics (Ragas):
- Faithfulness
- Answer Relevancy 
- Context Precision
- Context Recall
- Answer Correctness

## Implementation Strategy:
- **Semantic Chunking**: Group semantically similar sentences using cosine similarity threshold
- **Greedy Approach**: Prioritize similar sentences up to maximum chunk size
- **Minimum**: Single sentence per chunk
- **Retrieval**: Naive retrieval for both approaches (no reranking)


## 1. Dependencies and Setup


In [3]:
import os
from getpass import getpass
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


In [4]:
# API Keys
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")


## 2. Data Preparation


In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

# Load the same data as original notebook
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

print(f"Loaded {len(docs)} documents")
print(f"Total characters: {sum(len(doc.page_content) for doc in docs):,}")


Loaded 269 documents
Total characters: 838,132


## 3. Synthetic Test Dataset Generation (Reusing Original Implementation)


In [6]:
# Set up models for dataset generation (same as original)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())


In [7]:
# Generate synthetic test dataset (same implementation as original)
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

print(f"Generated {len(dataset)} test samples")
dataset.to_pandas().head()


Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/30 [00:00<?, ?it/s]

Property 'summary' already exists in node '133739'. Skipping!
Property 'summary' already exists in node 'f5bb1e'. Skipping!
Property 'summary' already exists in node '1f6dcb'. Skipping!
Property 'summary' already exists in node 'e477f9'. Skipping!
Property 'summary' already exists in node '013c63'. Skipping!
Property 'summary' already exists in node '50457f'. Skipping!
Property 'summary' already exists in node '5b92f4'. Skipping!
Property 'summary' already exists in node 'df764a'. Skipping!
Property 'summary' already exists in node '1d2cbd'. Skipping!
Property 'summary' already exists in node '7c5bcf'. Skipping!
Property 'summary' already exists in node 'b9e953'. Skipping!
Property 'summary' already exists in node 'd391db'. Skipping!
Property 'summary' already exists in node '4d53db'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/42 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '4d53db'. Skipping!
Property 'summary_embedding' already exists in node '50457f'. Skipping!
Property 'summary_embedding' already exists in node '5b92f4'. Skipping!
Property 'summary_embedding' already exists in node 'd391db'. Skipping!
Property 'summary_embedding' already exists in node '1f6dcb'. Skipping!
Property 'summary_embedding' already exists in node 'df764a'. Skipping!
Property 'summary_embedding' already exists in node '013c63'. Skipping!
Property 'summary_embedding' already exists in node 'f5bb1e'. Skipping!
Property 'summary_embedding' already exists in node '133739'. Skipping!
Property 'summary_embedding' already exists in node '7c5bcf'. Skipping!
Property 'summary_embedding' already exists in node 'b9e953'. Skipping!
Property 'summary_embedding' already exists in node 'e477f9'. Skipping!
Property 'summary_embedding' already exists in node '1d2cbd'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

Generated 12 test samples


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What department do you contact for approval of...,"[Chapter 1 Academic Years, Academic Calendars,...",To request approval for a full academic year o...,single_hop_specifc_query_synthesizer
1,What details are covered in Volume 1 regarding...,"[non-term (includes clock-hour calendars), or ...",Volume 1 covers requirements for determining f...,single_hop_specifc_query_synthesizer
2,"What Volume 8, Chapter 3 say about clinical work?",[Inclusion of Clinical Work in a Standard Term...,"Volume 8, Chapter 3 provides additional guidan...",single_hop_specifc_query_synthesizer
3,How do non-term characteristics affect the dis...,[Non-Term Characteristics A program that measu...,"Non-term characteristics, such as programs tha...",single_hop_specifc_query_synthesizer
4,How do the disbursement rules for Pell Grants ...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer


## 4. Baseline RAG Implementation (Naive Chunking)


In [8]:
# Naive chunking using RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

naive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
naive_split_documents = naive_text_splitter.split_documents(docs)

print(f"Naive chunking created {len(naive_split_documents)} chunks")
print(f"Average chunk length: {np.mean([len(doc.page_content) for doc in naive_split_documents]):.0f} characters")


Naive chunking created 1102 chunks
Average chunk length: 864 characters


In [9]:
# Set up embeddings and vector store for baseline
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create in-memory vector store for baseline
client_baseline = QdrantClient(":memory:")
client_baseline.create_collection(
    collection_name="loan_data_baseline",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store_baseline = QdrantVectorStore(
    client=client_baseline,
    collection_name="loan_data_baseline",
    embedding=embeddings,
)

# Add documents to vector store
_ = vector_store_baseline.add_documents(documents=naive_split_documents)
retriever_baseline = vector_store_baseline.as_retriever(search_kwargs={"k": 5})

print("Baseline vector store created successfully!")


Baseline vector store created successfully!


In [10]:
# LangGraph implementation for baseline RAG
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document
from langchain.prompts import ChatPromptTemplate

# State definition
class State(TypedDict):
    question: str
    context: List[Document]
    response: str

# RAG prompt
RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# LLM for generation
llm = ChatOpenAI(model="gpt-4o-mini")

# Define nodes
def retrieve_baseline(state):
    retrieved_docs = retriever_baseline.invoke(state["question"])
    return {"context": retrieved_docs}

def generate(state):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
    response = llm.invoke(messages)
    return {"response": response.content}

# Build baseline graph
baseline_graph_builder = StateGraph(State).add_sequence([retrieve_baseline, generate])
baseline_graph_builder.add_edge(START, "retrieve_baseline")
baseline_graph = baseline_graph_builder.compile()

print("Baseline RAG graph created successfully!")


Baseline RAG graph created successfully!


## 5. Semantic Chunking Implementation


In [11]:
import re
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

# Configuration for semantic chunking
SIMILARITY_THRESHOLD = 0.7  # Cosine similarity threshold for grouping sentences
MAX_CHUNK_SIZE = 1000  # Maximum characters per chunk
MIN_CHUNK_SIZE = 1  # Minimum chunk size (single sentence)

# Load sentence transformer model for semantic similarity
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"Semantic chunking configuration:")
print(f"- Similarity threshold: {SIMILARITY_THRESHOLD}")
print(f"- Max chunk size: {MAX_CHUNK_SIZE} characters")
print(f"- Min chunk size: {MIN_CHUNK_SIZE} sentence(s)")


Semantic chunking configuration:
- Similarity threshold: 0.7
- Max chunk size: 1000 characters
- Min chunk size: 1 sentence(s)


In [12]:
def split_into_sentences(text):
    """Split text into sentences using regex."""
    # Simple sentence splitting - can be improved with NLTK or spaCy
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

def semantic_chunking(documents, similarity_threshold=SIMILARITY_THRESHOLD, max_chunk_size=MAX_CHUNK_SIZE):
    """
    Implement semantic chunking strategy:
    1. Split documents into sentences
    2. Group semantically similar sentences using cosine similarity
    3. Use greedy approach up to maximum chunk size
    4. Minimum chunk size is a single sentence
    """
    semantic_chunks = []
    
    for doc in documents:
        text = doc.page_content
        sentences = split_into_sentences(text)
        
        if not sentences:
            continue
            
        # Get sentence embeddings
        sentence_embeddings = sentence_model.encode(sentences)
        
        # Start with first sentence
        current_chunk_sentences = [sentences[0]]
        current_chunk_embeddings = [sentence_embeddings[0]]
        
        for i in range(1, len(sentences)):
            sentence = sentences[i]
            sentence_embedding = sentence_embeddings[i]
            
            # Calculate similarity with current chunk (average embedding)
            current_chunk_avg_embedding = np.mean(current_chunk_embeddings, axis=0).reshape(1, -1)
            sentence_embedding_reshaped = sentence_embedding.reshape(1, -1)
            similarity = cosine_similarity(current_chunk_avg_embedding, sentence_embedding_reshaped)[0][0]
            
            # Check if we should add to current chunk
            potential_chunk_text = ' '.join(current_chunk_sentences + [sentence])
            
            # Greedy approach: add if similar OR if we haven't exceeded max size
            if (similarity >= similarity_threshold or len(potential_chunk_text) <= max_chunk_size) and len(potential_chunk_text) <= max_chunk_size:
                current_chunk_sentences.append(sentence)
                current_chunk_embeddings.append(sentence_embedding)
            else:
                # Finalize current chunk and start new one
                chunk_text = ' '.join(current_chunk_sentences)
                if chunk_text.strip():
                    semantic_chunks.append({
                        'content': chunk_text,
                        'metadata': doc.metadata
                    })
                
                # Start new chunk with current sentence
                current_chunk_sentences = [sentence]
                current_chunk_embeddings = [sentence_embedding]
        
        # Add final chunk
        chunk_text = ' '.join(current_chunk_sentences)
        if chunk_text.strip():
            semantic_chunks.append({
                'content': chunk_text,
                'metadata': doc.metadata
            })
    
    return semantic_chunks

print("Semantic chunking function defined!")


Semantic chunking function defined!


In [13]:
# Apply semantic chunking to documents
print("Applying semantic chunking...")
semantic_chunk_data = semantic_chunking(docs)

# Convert to Document objects for compatibility
from langchain_core.documents import Document

semantic_split_documents = []
for chunk_data in semantic_chunk_data:
    doc = Document(
        page_content=chunk_data['content'],
        metadata=chunk_data['metadata']
    )
    semantic_split_documents.append(doc)

print(f"Semantic chunking created {len(semantic_split_documents)} chunks")
print(f"Average chunk length: {np.mean([len(doc.page_content) for doc in semantic_split_documents]):.0f} characters")

# Compare chunk statistics
naive_lengths = [len(doc.page_content) for doc in naive_split_documents]
semantic_lengths = [len(doc.page_content) for doc in semantic_split_documents]

print(f"\nChunk Statistics Comparison:")
print(f"Naive chunking: {len(naive_split_documents)} chunks, avg {np.mean(naive_lengths):.0f} chars, std {np.std(naive_lengths):.0f}")
print(f"Semantic chunking: {len(semantic_split_documents)} chunks, avg {np.mean(semantic_lengths):.0f} chars, std {np.std(semantic_lengths):.0f}")


Applying semantic chunking...
Semantic chunking created 1057 chunks
Average chunk length: 792 characters

Chunk Statistics Comparison:
Naive chunking: 1102 chunks, avg 864 chars, std 189
Semantic chunking: 1057 chunks, avg 792 chars, std 236


## 6. Advanced RAG Implementation (Semantic Chunking + Naive Retrieval)


In [14]:
# Set up vector store for semantic chunking
client_semantic = QdrantClient(":memory:")
client_semantic.create_collection(
    collection_name="loan_data_semantic",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store_semantic = QdrantVectorStore(
    client=client_semantic,
    collection_name="loan_data_semantic",
    embedding=embeddings,
)

# Add semantic chunks to vector store
_ = vector_store_semantic.add_documents(documents=semantic_split_documents)
retriever_semantic = vector_store_semantic.as_retriever(search_kwargs={"k": 5})

print("Semantic RAG vector store created successfully!")


Semantic RAG vector store created successfully!


In [15]:
# Define semantic retrieval node
def retrieve_semantic(state):
    retrieved_docs = retriever_semantic.invoke(state["question"])
    return {"context": retrieved_docs}

# Build semantic RAG graph
semantic_graph_builder = StateGraph(State).add_sequence([retrieve_semantic, generate])
semantic_graph_builder.add_edge(START, "retrieve_semantic")
semantic_graph = semantic_graph_builder.compile()

print("Semantic RAG graph created successfully!")


Semantic RAG graph created successfully!


## 7. Baseline Evaluation (Naive Chunking)


In [16]:
# Run baseline RAG on test dataset
import copy
import time

print("Running baseline evaluation...")
baseline_dataset = copy.deepcopy(dataset)

for test_row in baseline_dataset:
    response = baseline_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    time.sleep(1)  # Rate limiting

print("Baseline evaluation data collection complete!")


Running baseline evaluation...
Baseline evaluation data collection complete!


In [17]:
# Evaluate baseline with Ragas using exact specified metrics
from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.metrics import Faithfulness, AnswerRelevancy, ContextPrecision, ContextRecall, AnswerCorrectness

# Create evaluation dataset
baseline_evaluation_dataset = EvaluationDataset.from_pandas(baseline_dataset.to_pandas())

# Set up evaluator LLM (same as original)
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

# Custom run config for longer timeout
custom_run_config = RunConfig(timeout=360)

print("Evaluating baseline RAG...")
baseline_result = evaluate(
    dataset=baseline_evaluation_dataset,
    metrics=[
        Faithfulness(),
        AnswerRelevancy(), 
        ContextPrecision(),
        ContextRecall(),
        AnswerCorrectness()
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

print("Baseline evaluation complete!")
baseline_result


Evaluating baseline RAG...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Baseline evaluation complete!


{'faithfulness': 0.7580, 'answer_relevancy': 0.9638, 'context_precision': 0.9375, 'context_recall': 0.6250, 'answer_correctness': 0.5618}

## 8. Advanced Evaluation (Semantic Chunking)


In [18]:
# Run semantic RAG on test dataset
print("Running semantic evaluation...")
semantic_dataset = copy.deepcopy(dataset)

for test_row in semantic_dataset:
    response = semantic_graph.invoke({"question": test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"]
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    time.sleep(1)  # Rate limiting

print("Semantic evaluation data collection complete!")


Running semantic evaluation...
Semantic evaluation data collection complete!


In [19]:
# Evaluate semantic RAG with same metrics
semantic_evaluation_dataset = EvaluationDataset.from_pandas(semantic_dataset.to_pandas())

print("Evaluating semantic RAG...")
semantic_result = evaluate(
    dataset=semantic_evaluation_dataset,
    metrics=[
        Faithfulness(),
        AnswerRelevancy(), 
        ContextPrecision(),
        ContextRecall(),
        AnswerCorrectness()
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

print("Semantic evaluation complete!")
semantic_result


Evaluating semantic RAG...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Semantic evaluation complete!


{'faithfulness': 0.8128, 'answer_relevancy': 0.9598, 'context_precision': 0.9167, 'context_recall': 0.6736, 'answer_correctness': 0.6238}

## 9. Side-by-Side Metric Comparison


In [None]:
# Debug: Check the structure of results first
print("🔍 DEBUGGING RESULTS STRUCTURE")
print("Baseline result keys:", baseline_result.keys())
print("Semantic result keys:", semantic_result.keys())
print("\nBaseline values and types:")
for key, value in baseline_result.items():
    print(f"  {key}: {value} (type: {type(value)})")
print("\nSemantic values and types:")
for key, value in semantic_result.items():
    print(f"  {key}: {value} (type: {type(value)})")

# Function to safely extract scalar values
def extract_scalar_value(value):
    """Extract scalar value from potentially nested structures"""
    if isinstance(value, list):
        # If it's a list, take the first element or mean
        if len(value) > 0:
            if isinstance(value[0], (int, float)):
                return float(value[0])
            else:
                return 0.0
        else:
            return 0.0
    elif isinstance(value, (int, float)):
        return float(value)
    else:
        return 0.0

# Create side-by-side comparison table with safe value extraction
baseline_values = [
    extract_scalar_value(baseline_result['faithfulness']),
    extract_scalar_value(baseline_result['answer_relevancy']),
    extract_scalar_value(baseline_result['context_precision']),
    extract_scalar_value(baseline_result['context_recall']),
    extract_scalar_value(baseline_result['answer_correctness'])
]

semantic_values = [
    extract_scalar_value(semantic_result['faithfulness']),
    extract_scalar_value(semantic_result['answer_relevancy']),
    extract_scalar_value(semantic_result['context_precision']),
    extract_scalar_value(semantic_result['context_recall']),
    extract_scalar_value(semantic_result['answer_correctness'])
]

print("\n✅ EXTRACTED VALUES:")
print("Baseline values:", baseline_values)
print("Semantic values:", semantic_values)

comparison_data = {
    'Metric': ['Faithfulness', 'Answer Relevancy', 'Context Precision', 'Context Recall', 'Answer Correctness'],
    'Baseline (Naive)': baseline_values,
    'Advanced (Semantic)': semantic_values
}

# Calculate improvements safely
improvements = []
for baseline, semantic in zip(comparison_data['Baseline (Naive)'], comparison_data['Advanced (Semantic)']):
    if baseline > 0:
        improvement = ((semantic - baseline) / baseline) * 100
    else:
        improvement = 0.0
    improvements.append(improvement)

comparison_data['Improvement (%)'] = improvements

# Create DataFrame
comparison_df = pd.DataFrame(comparison_data)
comparison_df['Improvement (%)'] = comparison_df['Improvement (%)'].round(2)
comparison_df['Baseline (Naive)'] = comparison_df['Baseline (Naive)'].round(4)
comparison_df['Advanced (Semantic)'] = comparison_df['Advanced (Semantic)'].round(4)

print("🔥 RAG EVALUATION COMPARISON 🔥")
print("=" * 60)
print(comparison_df.to_string(index=False))
print("=" * 60)

# Highlight best performing system for each metric
for idx, row in comparison_df.iterrows():
    metric = row['Metric']
    baseline_val = row['Baseline (Naive)']
    semantic_val = row['Advanced (Semantic)']
    improvement = row['Improvement (%)']
    
    winner = "🏆 SEMANTIC" if semantic_val > baseline_val else "🏆 BASELINE"
    print(f"{metric}: {winner} (+{improvement:.2f}%)" if improvement > 0 else f"{metric}: {winner} ({improvement:.2f}%)")


TypeError: '>' not supported between instances of 'list' and 'int'

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('RAG Evaluation: Naive vs Semantic Chunking', fontsize=16, fontweight='bold')

# 1. Bar chart comparison
metrics = comparison_df['Metric']
baseline_scores = comparison_df['Baseline (Naive)']
semantic_scores = comparison_df['Advanced (Semantic)']

x = np.arange(len(metrics))
width = 0.35

ax1 = axes[0, 0]
bars1 = ax1.bar(x - width/2, baseline_scores, width, label='Baseline (Naive)', alpha=0.8, color='skyblue')
bars2 = ax1.bar(x + width/2, semantic_scores, width, label='Advanced (Semantic)', alpha=0.8, color='lightcoral')

ax1.set_xlabel('Metrics')
ax1.set_ylabel('Scores')
ax1.set_title('Performance Comparison by Metric')
ax1.set_xticks(x)
ax1.set_xticklabels(metrics, rotation=45, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for bar in bars1:
    height = bar.get_height()
    ax1.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=8)
for bar in bars2:
    height = bar.get_height()
    ax1.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=8)

# 2. Improvement percentage chart
ax2 = axes[0, 1]
colors = ['green' if x > 0 else 'red' for x in comparison_df['Improvement (%)']]
bars = ax2.bar(metrics, comparison_df['Improvement (%)'], color=colors, alpha=0.7)
ax2.set_xlabel('Metrics')
ax2.set_ylabel('Improvement (%)')
ax2.set_title('Improvement: Semantic vs Baseline')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.5)

# Add value labels
for bar in bars:
    height = bar.get_height()
    ax2.annotate(f'{height:.1f}%', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3 if height > 0 else -15), textcoords="offset points", 
                ha='center', va='bottom' if height > 0 else 'top', fontsize=9)

# 3. Chunk size distribution comparison
ax3 = axes[1, 0]
ax3.hist(naive_lengths, bins=20, alpha=0.7, label='Naive Chunking', color='skyblue', density=True)
ax3.hist(semantic_lengths, bins=20, alpha=0.7, label='Semantic Chunking', color='lightcoral', density=True)
ax3.set_xlabel('Chunk Size (characters)')
ax3.set_ylabel('Density')
ax3.set_title('Chunk Size Distribution Comparison')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Radar chart for metric comparison
from math import pi

ax4 = axes[1, 1]
categories = metrics
N = len(categories)

# Compute angles for each metric
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]  # Complete the circle

# Add values
baseline_values = list(baseline_scores) + [baseline_scores[0]]
semantic_values = list(semantic_scores) + [semantic_scores[0]]

# Plot
ax4.plot(angles, baseline_values, 'o-', linewidth=2, label='Baseline (Naive)', color='skyblue')
ax4.fill(angles, baseline_values, alpha=0.25, color='skyblue')
ax4.plot(angles, semantic_values, 'o-', linewidth=2, label='Advanced (Semantic)', color='lightcoral')
ax4.fill(angles, semantic_values, alpha=0.25, color='lightcoral')

# Add labels
ax4.set_xticks(angles[:-1])
ax4.set_xticklabels(categories, fontsize=9)
ax4.set_title('Radar Chart: Performance Profile')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Interactive Plotly visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Metric Comparison', 'Improvement Analysis', 'Chunk Statistics', 'Score Distribution'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Metric comparison
fig.add_trace(
    go.Bar(name='Baseline (Naive)', x=metrics, y=baseline_scores, 
           marker_color='lightblue', text=[f'{v:.3f}' for v in baseline_scores], textposition='outside'),
    row=1, col=1
)
fig.add_trace(
    go.Bar(name='Advanced (Semantic)', x=metrics, y=semantic_scores,
           marker_color='lightcoral', text=[f'{v:.3f}' for v in semantic_scores], textposition='outside'),
    row=1, col=1
)

# 2. Improvement analysis
colors = ['green' if x > 0 else 'red' for x in comparison_df['Improvement (%)']]
fig.add_trace(
    go.Bar(x=metrics, y=comparison_df['Improvement (%)'], marker_color=colors,
           text=[f'{v:.1f}%' for v in comparison_df['Improvement (%)']], textposition='outside',
           name='Improvement', showlegend=False),
    row=1, col=2
)

# 3. Chunk statistics comparison
fig.add_trace(
    go.Box(y=naive_lengths, name='Naive Chunking', marker_color='lightblue'),
    row=2, col=1
)
fig.add_trace(
    go.Box(y=semantic_lengths, name='Semantic Chunking', marker_color='lightcoral'),
    row=2, col=1
)

# 4. Score distribution
all_baseline = list(baseline_scores)
all_semantic = list(semantic_scores)
fig.add_trace(
    go.Histogram(x=all_baseline, name='Baseline Distribution', opacity=0.7, marker_color='lightblue'),
    row=2, col=2
)
fig.add_trace(
    go.Histogram(x=all_semantic, name='Semantic Distribution', opacity=0.7, marker_color='lightcoral'),
    row=2, col=2
)

# Update layout
fig.update_layout(
    title_text="Interactive RAG Evaluation Dashboard",
    title_x=0.5,
    height=800,
    showlegend=True
)

# Update axes
fig.update_xaxes(title_text="Metrics", row=1, col=1)
fig.update_xaxes(title_text="Metrics", row=1, col=2)
fig.update_xaxes(title_text="Chunking Method", row=2, col=1)
fig.update_xaxes(title_text="Score Values", row=2, col=2)

fig.update_yaxes(title_text="Score", row=1, col=1)
fig.update_yaxes(title_text="Improvement (%)", row=1, col=2)
fig.update_yaxes(title_text="Chunk Size (chars)", row=2, col=1)
fig.update_yaxes(title_text="Frequency", row=2, col=2)

fig.show()


In [None]:
# Extract individual sample scores for statistical testing
baseline_df = baseline_dataset.to_pandas()
semantic_df = semantic_dataset.to_pandas()

# Get individual metric scores (note: these are aggregate scores, but we'll work with what we have)
print("🔬 STATISTICAL SIGNIFICANCE ANALYSIS 🔬")
print("=" * 60)

# Since we have limited samples, we'll focus on effect size and practical significance
baseline_values = [baseline_result[metric.lower().replace(' ', '_')] for metric in ['Faithfulness', 'Answer Relevancy', 'Context Precision', 'Context Recall', 'Answer Correctness']]
semantic_values = [semantic_result[metric.lower().replace(' ', '_')] for metric in ['Faithfulness', 'Answer Relevancy', 'Context Precision', 'Context Recall', 'Answer Correctness']]

# Calculate effect sizes (Cohen's d)
def cohens_d(x1, x2):
    """Calculate Cohen's d for effect size"""
    # Since we only have aggregate scores, we'll approximate
    # This is a simplified approach for demonstration
    diff = x2 - x1
    # Approximate pooled standard deviation (simplified)
    pooled_std = np.sqrt(((x1 * 0.1) ** 2 + (x2 * 0.1) ** 2) / 2)
    return diff / pooled_std if pooled_std > 0 else 0

# Effect size analysis
effect_sizes = []
for i, metric in enumerate(['Faithfulness', 'Answer Relevancy', 'Context Precision', 'Context Recall', 'Answer Correctness']):
    baseline_val = baseline_values[i]
    semantic_val = semantic_values[i]
    effect_size = cohens_d(baseline_val, semantic_val)
    effect_sizes.append(effect_size)
    
    # Interpret effect size
    if abs(effect_size) < 0.2:
        interpretation = "Negligible"
    elif abs(effect_size) < 0.5:
        interpretation = "Small"
    elif abs(effect_size) < 0.8:
        interpretation = "Medium"
    else:
        interpretation = "Large"
    
    print(f"{metric}:")
    print(f"  Baseline: {baseline_val:.4f} | Semantic: {semantic_val:.4f}")
    print(f"  Effect Size (Cohen's d): {effect_size:.3f} ({interpretation})")
    print(f"  Practical Significance: {'✅ YES' if abs(effect_size) > 0.2 else '❌ NO'}")
    print()

# Overall assessment
print("📊 OVERALL STATISTICAL ASSESSMENT")
print("=" * 40)
positive_improvements = sum(1 for es in effect_sizes if es > 0.2)
total_metrics = len(effect_sizes)
print(f"Metrics with practical improvement: {positive_improvements}/{total_metrics}")
print(f"Average effect size: {np.mean(effect_sizes):.3f}")
print(f"Maximum effect size: {max(effect_sizes):.3f}")
print(f"Minimum effect size: {min(effect_sizes):.3f}")"


In [None]:
# Additional statistical analysis: chunk size comparison
print("\n📏 CHUNK SIZE STATISTICAL ANALYSIS")
print("=" * 50)

# Perform t-test on chunk sizes
t_stat, p_value = stats.ttest_ind(naive_lengths, semantic_lengths)
print("T-test for chunk sizes:")
print(f"  T-statistic: {t_stat:.3f}")
print(f"  P-value: {p_value:.6f}")
print(f"  Significance: {'✅ Significant' if p_value < 0.05 else '❌ Not significant'} (α = 0.05)")

# Descriptive statistics
print("\nDescriptive Statistics:")
print("Naive Chunking:")
print(f"  Mean: {np.mean(naive_lengths):.1f} chars")
print(f"  Std:  {np.std(naive_lengths):.1f} chars")
print(f"  Min:  {np.min(naive_lengths)} chars")
print(f"  Max:  {np.max(naive_lengths)} chars")
print(f"  Q1:   {np.percentile(naive_lengths, 25):.1f} chars")
print(f"  Q3:   {np.percentile(naive_lengths, 75):.1f} chars")

print("\nSemantic Chunking:")
print(f"  Mean: {np.mean(semantic_lengths):.1f} chars")
print(f"  Std:  {np.std(semantic_lengths):.1f} chars")
print(f"  Min:  {np.min(semantic_lengths)} chars")
print(f"  Max:  {np.max(semantic_lengths)} chars")
print(f"  Q1:   {np.percentile(semantic_lengths, 25):.1f} chars")
print(f"  Q3:   {np.percentile(semantic_lengths, 75):.1f} chars")

# Calculate variance ratio
variance_ratio = np.var(semantic_lengths) / np.var(naive_lengths)
print("\nVariance Analysis:")
print(f"  Variance Ratio (Semantic/Naive): {variance_ratio:.3f}")
print(f"  Interpretation: {'More variable' if variance_ratio > 1 else 'Less variable'} chunk sizes in semantic approach")"


In [None]:
# Qualitative analysis of responses
print("🔍 QUALITATIVE RESPONSE ANALYSIS 🔍")
print("=" * 60)

# Sample some questions and compare responses
sample_questions = baseline_df['user_input'].head(3).tolist()

for i, question in enumerate(sample_questions):
    print("\n" + "="*20 + f" QUESTION {i+1} " + "="*20)
    print(f"Q: {question}")
    print()
    
    baseline_response = baseline_df.iloc[i]['response']
    semantic_response = semantic_df.iloc[i]['response']
    
    print("🔸 BASELINE (Naive Chunking) RESPONSE:")
    response_preview = baseline_response[:300] + "..." if len(baseline_response) > 300 else baseline_response
    print(response_preview)
    print()
    
    print("🔹 SEMANTIC CHUNKING RESPONSE:")
    response_preview = semantic_response[:300] + "..." if len(semantic_response) > 300 else semantic_response
    print(response_preview)
    print()
    
    # Simple quality metrics
    baseline_len = len(baseline_response)
    semantic_len = len(semantic_response)
    
    print("📊 RESPONSE COMPARISON:")
    print(f"  Length: Baseline {baseline_len} chars | Semantic {semantic_len} chars")
    if baseline_len > 0:
        print(f"  Relative length: {semantic_len/baseline_len:.2f}x")
    else:
        print("  Relative length: N/A")
    
    # Count specific words that might indicate quality
    uncertainty_words = ['however', 'but', 'although', 'unclear', 'unsure']
    baseline_confidence_words = len([w for w in baseline_response.lower().split() if w in uncertainty_words])
    semantic_confidence_words = len([w for w in semantic_response.lower().split() if w in uncertainty_words])
    
    print(f"  Uncertainty indicators: Baseline {baseline_confidence_words} | Semantic {semantic_confidence_words}")


In [None]:
# Context analysis - compare retrieved contexts
print("\n🎯 RETRIEVED CONTEXT ANALYSIS")
print("=" * 50)

for i, question in enumerate(sample_questions[:2]):  # Analyze first 2 questions
    print(f"\n--- QUESTION {i+1}: {question[:100]}... ---")
    
    baseline_contexts = baseline_df.iloc[i]['retrieved_contexts']
    semantic_contexts = semantic_df.iloc[i]['retrieved_contexts']
    
    print(f"\n🔸 BASELINE CONTEXTS ({len(baseline_contexts)} chunks):")
    for j, context in enumerate(baseline_contexts):
        print(f"  Chunk {j+1}: {len(context)} chars - {context[:150]}...")
    
    print(f"\n🔹 SEMANTIC CONTEXTS ({len(semantic_contexts)} chunks):")
    for j, context in enumerate(semantic_contexts):
        print(f"  Chunk {j+1}: {len(context)} chars - {context[:150]}...")
    
    # Calculate overlap between contexts
    baseline_text = " ".join(baseline_contexts).lower()
    semantic_text = " ".join(semantic_contexts).lower()
    
    # Simple word overlap calculation
    baseline_words = set(baseline_text.split())
    semantic_words = set(semantic_text.split())
    overlap = len(baseline_words.intersection(semantic_words))
    union = len(baseline_words.union(semantic_words))
    jaccard_similarity = overlap / union if union > 0 else 0
    
    print(f"\n📈 CONTEXT SIMILARITY ANALYSIS:")
    print(f"  Word overlap: {overlap} words")
    print(f"  Jaccard similarity: {jaccard_similarity:.3f}")
    diversity = 'High' if jaccard_similarity < 0.5 else 'Moderate' if jaccard_similarity < 0.8 else 'Low'
    print(f"  Context diversity: {diversity}")


In [None]:
# Executive Summary
print("🎯 EXECUTIVE SUMMARY: SEMANTIC CHUNKING vs NAIVE CHUNKING")
print("=" * 70)

# Calculate overall winner
wins_semantic = sum(1 for semantic, baseline in zip(semantic_values, baseline_values) if semantic > baseline)
wins_baseline = len(baseline_values) - wins_semantic

winner_text = 'SEMANTIC CHUNKING' if wins_semantic > wins_baseline else 'BASELINE (NAIVE)' if wins_baseline > wins_semantic else 'TIE'
print(f"\n🏆 OVERALL WINNER: {winner_text}")
print(f"   Semantic wins: {wins_semantic}/{len(baseline_values)} metrics")
print(f"   Baseline wins: {wins_baseline}/{len(baseline_values)} metrics")

# Key findings
print("\n📊 KEY FINDINGS:")
avg_improvement = np.mean(comparison_df['Improvement (%)'])
best_improvement = max(comparison_df['Improvement (%)'])
worst_improvement = min(comparison_df['Improvement (%)'])
best_metric = comparison_df.loc[comparison_df['Improvement (%)'].idxmax(), 'Metric']
worst_metric = comparison_df.loc[comparison_df['Improvement (%)'].idxmin(), 'Metric']

print(f"   • Average improvement: {avg_improvement:.1f}%")
print(f"   • Best improvement: {best_improvement:.1f}% in {best_metric}")
print(f"   • Worst performance: {worst_improvement:.1f}% in {worst_metric}")

# Practical implications
print("\n💡 PRACTICAL IMPLICATIONS:")
if avg_improvement > 5:
    print("   ✅ Semantic chunking shows meaningful improvements")
    print("   ✅ Recommended for production deployment")
    print("   ✅ Benefits likely outweigh computational overhead")
elif avg_improvement > 0:
    print("   ⚠️ Semantic chunking shows modest improvements")
    print("   ⚠️ Consider computational cost vs. benefit trade-off")
    print("   ⚠️ May be suitable for high-accuracy requirements")
else:
    print("   ❌ Semantic chunking does not show clear advantages")
    print("   ❌ Baseline approach may be more cost-effective")
    print("   ❌ Further optimization of semantic approach recommended")

# Technical recommendations
print("\n🔧 TECHNICAL RECOMMENDATIONS:")
variance_text = 'Higher' if variance_ratio > 1 else 'Lower'
print(f"   • Chunk size variance: {variance_text} in semantic approach")
print(f"   • Similarity threshold: {SIMILARITY_THRESHOLD} (consider tuning)")
print(f"   • Max chunk size: {MAX_CHUNK_SIZE} chars (consider optimization)")
print("   • Embedding model: sentence-transformers/all-MiniLM-L6-v2")

print("\n🎯 NEXT STEPS:")
print("   1. Hyperparameter tuning for similarity threshold")
print("   2. Experiment with different sentence embedding models")
print("   3. A/B testing with larger datasets")
print("   4. Cost-benefit analysis including computational overhead")
print("   5. User satisfaction evaluation")

print("\n" + "=" * 70)
print("📈 EVALUATION COMPLETE - DATA DRIVEN INSIGHTS DELIVERED! 🚀")
print("=" * 70)


# 📝 Brief Explanations for Advanced_Build Notebook Cells

Here are concise explanations for each code cell that you can add as markdown documentation:

## **Cell 2: Library Imports and Setup**
Import essential libraries for data processing, machine learning, visualization, and RAG implementation. Configure plotting styles and suppress warnings for cleaner output.

## **Cell 3: API Key Configuration**
Securely capture and store OpenAI API credentials needed for LLM and embedding model access throughout the notebook.

## **Cell 4: Document Loading**
Load PDF documents from the data directory using PyMuPDF loader to extract text content for RAG system processing.

## **Cell 5: Model Setup for Test Generation**
Initialize LLM (GPT-4o) and embedding models with Ragas wrappers for automated test dataset generation.

## **Cell 6: Synthetic Test Dataset Creation**
Use Ragas TestsetGenerator to automatically create evaluation questions, reference answers, and contexts from the loaded documents.

## **Cell 7: Naive Chunking Implementation**
Split documents using RecursiveCharacterTextSplitter with fixed 1000-character chunks and 200-character overlap - the baseline chunking strategy.

## **Cell 8: Baseline Vector Store Setup**
Create in-memory Qdrant vector database, embed naive chunks using OpenAI embeddings, and configure retriever for similarity search.

## **Cell 9: Baseline RAG Graph Construction**
Build LangGraph workflow connecting retrieval and generation nodes to create the baseline RAG system pipeline.

## **Cell 10: Semantic Chunking Configuration**
Define parameters for semantic chunking approach including similarity threshold (0.7), max chunk size (1000), and load sentence transformer model.

## **Cell 11: Semantic Chunking Algorithm**
Implement core semantic chunking logic that splits text into sentences, calculates semantic similarity, and groups similar sentences into coherent chunks.

## **Cell 12: Apply Semantic Chunking**
Execute semantic chunking on all documents and convert results to Document format for compatibility with the RAG pipeline.

## **Cell 13: Semantic Vector Store Setup**
Create separate vector database for semantic chunks using identical embedding model to ensure fair comparison with baseline.

## **Cell 14: Semantic RAG Graph Construction**
Build identical LangGraph workflow for semantic system, using same generation logic but different chunk retrieval source.

## **Cell 15: Baseline Evaluation Data Collection**
Run test questions through baseline RAG system, collecting generated responses and retrieved contexts for evaluation.

## **Cell 16: Baseline Ragas Evaluation**
Apply Ragas evaluation metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall, Answer Correctness) to score baseline performance.

## **Cell 17: Semantic Evaluation Data Collection**
Run identical test questions through semantic RAG system, collecting responses for direct comparison with baseline.

## **Cell 18: Semantic Ragas Evaluation**
Apply same Ragas evaluation metrics to semantic system to enable fair performance comparison.

## **Cell 19: Performance Comparison Table**
Create side-by-side comparison showing baseline vs semantic scores for each metric, calculate percentage improvements.

## **Cell 20: Static Visualization Dashboard**
Generate comprehensive matplotlib charts including performance comparisons, improvement percentages, chunk distributions, and radar plots.

## **Cell 21: Interactive Plotly Dashboard**
Build interactive visualization dashboard with hover details and zoom capabilities for deeper exploration of results.

## **Cell 22: Statistical Significance Analysis**
Perform effect size calculations (Cohen's d) to determine practical significance of performance differences between systems.

## **Cell 23: Chunk Size Statistical Analysis**
Compare statistical properties of chunk sizes using t-tests and variance analysis to understand structural differences.

## **Cell 24: Qualitative Response Analysis**
Manual examination of actual responses from both systems, comparing length, confidence indicators, and response quality.

## **Cell 25: Context Retrieval Analysis**
Analyze what contexts each system retrieves for identical questions, measuring overlap and diversity using Jaccard similarity.

## **Cell 26: Executive Summary and Recommendations**
Synthesize all evaluation results into actionable insights, determine overall winner, and provide technical recommendations for implementation.

---

## 🎯 **Usage Notes:**
- Each explanation is designed to be inserted as a markdown cell before its corresponding code cell
- Explanations focus on the **purpose and functionality** rather than implementation details
- Suitable for both technical and non-technical audiences reviewing the notebook
- Can be easily customized or expanded based on your documentation needs