# Experiment: Domain-Adapted SAC-RAG for Kenyan Law

## Mitigating Document-Level Retrieval Mismatch (DRM) using Structure-Aware Parsing (GROBID) and Summary-Augmented Chunking

This notebook implements a rigorous experiment comparing two RAG pipelines:
1.  **Base RAG**: Standard recursive chunking (Naive RAG).
2.  **SAC-RAG (Summary-Augmented Chunking)**: Prepending document-level summaries to each chunk.

### Methodology
-   **Data Source**: Curated Purposive Sampling (N ≈ 40) from local PDFs (Kenyan Case Law/Acts).
-   **Parsing**: **GROBID** (via Docker) for structure-aware extraction (Metadata vs. Body).
-   **Synthetic Evaluation**: **RAGAS** used to generate 40 synthetic Q&A pairs.
-   **Expert Evaluation**: A "Golden Set" of 10 expert questions.
-   **Models**: AWS Bedrock (Claude 3.5 Sonnet, Titan Embeddings v2).
-   **Evaluation Framework**: RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision).

### Prerequisites
1.  **Docker**: Running locally (`docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.8.2-crf`).
2.  **PDFs**: Legal documents placed in `./legal_pdfs`.
3.  **AWS Config**: Credentials configured in environment variables.

# Step 1: Setup & Dependencies

In [1]:
!pip install -q boto3 botocore langchain langchain_community langchain_aws beautifulsoup4 chromadb pandas scikit-learn lxml html5lib grobid-client-python ragas datasets pypdf langchain-core openpyxl

# Step 2: Configuration (AWS & GROBID)

Initialize AWS Bedrock clients and the GROBID client.

In [2]:
import boto3
import os
from langchain_aws import ChatBedrock, BedrockEmbeddings
from grobid_client.grobid_client import GrobidClient

# --- AWS CREDENTIALS ---
os.environ["AWS_ACCESS_KEY_ID"] = "AKIAYQNJSKXY3FVHIJAC"
os.environ["AWS_SECRET_ACCESS_KEY"] = "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"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

# 1. Setup Clients
bedrock_client = boto3.client(service_name='bedrock-runtime', region_name="us-east-1")

# 2. Initialize LLM (Claude 3.5 Sonnet)
llm_generate = ChatBedrock(
    model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",  # Claude Sonnet 4.5!",
    client=bedrock_client,
    model_kwargs={"temperature": 0.0}
)

# 3. Initialize Embeddings (Titan Text v2)
llm_embed = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0",
    client=bedrock_client
)

# 4. Initialize GROBID Client
try:
    grobid_client = GrobidClient(config_path=None)
    print("✅ GROBID client initialized successfully.")
except Exception as e:
    print(f"⚠️ Warning: GROBID client failed to initialize. Ensure Docker is running. Error: {e}")
    grobid_client = None

print("✅ AWS Bedrock clients initialized (Claude 3.5 Sonnet + Titan Embeddings v2).")

INFO:botocore.credentials:Found credentials in environment variables.


✅ GROBID client initialized successfully.
✅ AWS Bedrock clients initialized (Claude 3.5 Sonnet + Titan Embeddings v2).


# Step 3: The Robust Ingestor Function

We define a `LegalDocumentLoader` class that:
- Loads PDFs from local directory using GROBID for structure-aware parsing
- Falls back to PyPDF if GROBID fails
- Includes commented crawler logic for future use

In [3]:
import os
import requests
import time
from bs4 import BeautifulSoup
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader

class LegalDocumentLoader:
    def __init__(self, pdf_dir="./legal_pdfs", grobid_server="http://localhost:8070"):
        self.pdf_dir = pdf_dir
        self.grobid_server = grobid_server
        self.grobid_available = self._check_grobid()
        os.makedirs(self.pdf_dir, exist_ok=True)

    def _check_grobid(self):
        """Check if GROBID server is running"""
        try:
            response = requests.get(f"{self.grobid_server}/api/isalive", timeout=2)
            return response.status_code == 200
        except:
            return False

    def load_from_local(self):
        """Iterates through the local PDF directory and parses using GROBID or PyPDF fallback."""
        documents = []
        pdf_files = [f for f in os.listdir(self.pdf_dir) if f.endswith('.pdf')]
        
        print(f"📁 Found {len(pdf_files)} PDFs in {self.pdf_dir}. Processing...")
        
        for i, pdf_file in enumerate(pdf_files, 1):
            path = os.path.join(self.pdf_dir, pdf_file)
            doc = None
            
            # Try GROBID first if available
            if self.grobid_available:
                try:
                    doc = self._parse_with_grobid(path, pdf_file)
                    if doc:
                        print(f"  [{i}/{len(pdf_files)}] ✅ [GROBID] {pdf_file[:50]}...")
                except Exception as e:
                    print(f"  [{i}/{len(pdf_files)}] ⚠️ [GROBID ERROR] {pdf_file[:30]}... - {str(e)[:50]}")
            
            # Fallback to PyPDF
            if not doc:
                try:
                    loader = PyPDFLoader(path)
                    pages = loader.load()
                    text = "\n".join([p.page_content for p in pages])
                    if text.strip():
                        doc = Document(
                            page_content=text,
                            metadata={"source": pdf_file, "title": pdf_file}
                        )
                        print(f"  [{i}/{len(pdf_files)}] ✅ [PyPDF] {pdf_file[:50]}...")
                except Exception as e:
                    print(f"  [{i}/{len(pdf_files)}] ❌ [PyPDF ERROR] {pdf_file[:30]}... - {str(e)[:50]}")
            
            if doc:
                documents.append(doc)
                
        return documents

    def _parse_with_grobid(self, pdf_path, pdf_filename):
        """
        FIXED: Direct HTTP request to GROBID server
        This bypasses the broken grobid-client library call
        """
        # Send PDF file directly to GROBID API
        with open(pdf_path, 'rb') as pdf_file:
            files = {
                'input': (pdf_filename, pdf_file, 'application/pdf')
            }
            
            # GROBID API endpoint with all parameters
            response = requests.post(
                f"{self.grobid_server}/api/processFulltextDocument",
                files=files,
                data={
                    'generateIDs': '1',
                    'consolidateHeader': '1',
                    'consolidateCitations': '0',
                    'includeRawCitations': '0',
                    'includeRawAffiliations': '0',
                    'teiCoordinates': ['ref', 'biblStruct', 'figure', 'formula'],
                    'segmentSentences': '0'
                },
                timeout=60
            )
        
        if response.status_code != 200:
            return None
        
        # Parse XML response
        soup = BeautifulSoup(response.text, 'xml')
        
        # Extract Title
        title_node = soup.find('title')
        title = title_node.get_text(strip=True) if title_node else pdf_filename
        
        # Extract Body (main judgment text)
        body_node = soup.find('body')
        body_text = body_node.get_text(separator="\n", strip=True) if body_node else ""
        
        if len(body_text) < 100:  # Skip empty/bad parses
            return None
            
        return Document(
            page_content=body_text,
            metadata={"source": pdf_filename, "title": title}
        )

# Initialize Loader (no grobid_client needed anymore!)
loader = LegalDocumentLoader(pdf_dir="./legal_pdfs", grobid_server="http://localhost:8070")

# Load Data from Local PDFs
print("\n" + "="*60)
print("LOADING DOCUMENTS FROM LOCAL DIRECTORY")
print("="*60 + "\n")

if loader.grobid_available:
    print("✅ GROBID server detected - using structure-aware parsing\n")
else:
    print("⚠️ GROBID server not available - using PyPDF fallback\n")

original_documents = loader.load_from_local()
print(f"\n✅ Successfully loaded {len(original_documents)} documents.")


LOADING DOCUMENTS FROM LOCAL DIRECTORY

✅ GROBID server detected - using structure-aware parsing

📁 Found 55 PDFs in ./legal_pdfs. Processing...


  [1/55] ✅ [GROBID] Banking Insurance  Finance Union (Kenya) v Capital...
  [2/55] ✅ [GROBID] Civil_Appeal_153__155_of_2007.pdf...
  [3/55] ✅ [GROBID] Constitution_of_Kenya_2010.pdf...
  [4/55] ✅ [GROBID] Dina Management Limited v County Government of Mom...
  [5/55] ✅ [GROBID] Dina Management Ltd v County Government of Mombasa...
  [6/55] ✅ [GROBID] Dina Management Ltd v County Government of Mombasa...
  [7/55] ✅ [GROBID] Fanikiwa Limited  3 others v Sirikwa Squatters Gro...
  [8/55] ✅ [GROBID] Fanikiwa Limited v Sirikwa Squatters Group  17 oth...
  [9/55] ✅ [GROBID] Fanikiwa Limited v Sirikwa Squatters Group  20 oth...
  [10/55] ✅ [GROBID] Fanikiwa Limited v Sirikwa Squatters Group  20 oth...
  [11/55] ✅ [GROBID] Finance Act 2023.pdf...
  [12/55] ✅ [GROBID] Fugicha v Methodist Church in Kenya (Suing Through...
  [13/55] ✅ [GROBID] Fugicha v Methodist Church in Kenya (Through its r...
  [14/55] ✅ [GROBID] G4S Security Services (K) Limited v Joseph Kamau  ...
  [15/55] ✅ [GROBID] G4s S

# Step 4: Generate SAC Summaries

Generate a "Legal Abstract" for each document using Claude 3.5 Sonnet. This summary will be prepended to each chunk in the SAC-RAG pipeline to provide global document context.

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

summary_prompt = ChatPromptTemplate.from_template(
    """You are an expert legal document summarizer specialized in Kenyan Law.
    
Summarize the following legal document, highlighting:
1. The Ratio Decidendi (core legal reasoning/ruling)
2. Key statutes, acts, or constitutional provisions cited
3. Main parties involved
4. Legal principles established

Keep the summary concise (3-5 sentences) and optimized for providing context to smaller text chunks.

Document:
{document_content}

Summary:"""
)
summary_chain = summary_prompt | llm_generate | StrOutputParser()

document_summaries = {}
print("\n" + "="*60)
print("GENERATING SAC SUMMARIES")
print("="*60 + "\n")

for i, doc in enumerate(original_documents, 1):
    source = doc.metadata['source']
    try:
        # Limit context to first 15k chars for speed/cost
        summary = summary_chain.invoke({"document_content": doc.page_content[:15000]})
        document_summaries[source] = summary
        print(f"  [{i}/{len(original_documents)}] ✅ {source[:60]}...")
    except Exception as e:
        print(f"  [{i}/{len(original_documents)}] ❌ Error: {source[:40]}... - {str(e)[:50]}")
        document_summaries[source] = "Summary unavailable."

print(f"\n✅ Generated {len(document_summaries)} summaries for SAC-RAG pipeline.")

INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



GENERATING SAC SUMMARIES



INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [1/52] ✅ Banking Insurance  Finance Union (Kenya) v Capital Sacco Soc...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [2/52] ✅ Civil_Appeal_153__155_of_2007.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [3/52] ✅ Constitution_of_Kenya_2010.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [4/52] ✅ Dina Management Limited v County Government of Mombasa  5 ot...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [5/52] ✅ Dina Management Ltd v County Government of Mombasa  5 others...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [6/52] ✅ Dina Management Ltd v County Government of Mombasa  5 others...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [7/52] ✅ Fanikiwa Limited  3 others v Sirikwa Squatters Group  17 oth...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [8/52] ✅ Fanikiwa Limited v Sirikwa Squatters Group  17 others (Civil...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [9/52] ✅ Fanikiwa Limited v Sirikwa Squatters Group  20 others (Petit...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [10/52] ✅ Fanikiwa Limited v Sirikwa Squatters Group  20 others (Petit...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [11/52] ✅ Finance Act 2023.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [12/52] ✅ Fugicha v Methodist Church in Kenya (Suing Through its Regis...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [13/52] ✅ Fugicha v Methodist Church in Kenya (Through its registered ...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [14/52] ✅ G4S Security Services (K) Limited v Joseph Kamau  468 others...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [15/52] ✅ G4s Security Services (K) Ltd v Fred Wanyonyi Simiyu Mutinyo...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [16/52] ✅ Harshavadan P Shah Ta Vipees Through the Republic  another v...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [17/52] ✅ Kenya Airports Authority v MituBell Welfare Society  2 other...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [18/52] ✅ Kenya Airports Authority v MituBell Welfare Society  another...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [19/52] ✅ Land Registration Act.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [20/52] ✅ Law of Succession Act.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [21/52] ✅ Matrimonial Property Act.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [22/52] ✅ Methodist Church in Kenya v Fugicha  3 others (Petition 16of...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [23/52] ✅ MNK alias MNP v POM (Civil Application 5of2020) 2021KESC46(K...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [24/52] ✅ MNK v POM Initiative for Strategic Litigation in Africa (ISL...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [25/52] ✅ Okiya Omtatah Okoiti v Cabinet Secretary National Treasury  ...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [26/52] ✅ P N N v Z W N 2017KECA753(KLR).pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [27/52] ✅ Penal Code.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [28/52] ✅ Peter Katithi Kithome v Laboratory  Allied Limited 2021KEELR...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [29/52] ✅ Public Authorities Limitation Act.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [30/52] ✅ PublicAuthoritiesLimitationAct_Cap39_.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [31/52] ✅ Republic  v Kenya Revenue Authority Exparte  African Boot Co...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [32/52] ✅ Republic  v Kenya Revenue Authority Exparte  African Boot Co...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [33/52] ✅ Republic v  Kenya Revenue Authority Exparte Total Kenya Limi...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [34/52] ✅ Republic v  Kenya Revenue Authority Exparte Total Kenya Limi...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [35/52] ✅ Republic v Cabinet Secretary National Treasury and Planning ...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [36/52] ✅ Republic v Kenya Revenue Authority Ex parte Tom Odhiambo Oji...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [37/52] ✅ Republic v Kenya Revenue Authority Ex Parte Webb Fontaine Gr...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [38/52] ✅ REPUBLIC V KENYA REVENUE AUTHORITY EXPARTE YAYA TOWERS LIMIT...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [39/52] ✅ REPUBLIC v THOMAS GILBERT CHOLMONDELEY 2009KEHC3921(KLR).pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [40/52] ✅ REPUBLIC v THOMAS PATRICK GILBERT CHOLMONDELEY 2007KEHC2713(...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [41/52] ✅ REPUBLIC v THOMAS PATRICK GILBERT CHOLMONDELEY 2009KEHC3853(...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [42/52] ✅ Sirikwa Squatters Group v Fanikiwa Limited  20 others (Petit...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [43/52] ✅ Tax Procedures Act.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [44/52] ✅ The Matrimonial Property Rules.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [45/52] ✅ The Probate and Administration Rules.pdf...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [46/52] ✅ Thomas Patrick Gilbert Cholmondeley v Republic 2008KECA319(K...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [47/52] ✅ Torino Enterprises Limited v Attorney General (Petition 5 (E...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [48/52] ✅ Torino Enterprises Limited v Attorney General (Petition 5 (E...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [49/52] ✅ Torino Enterprises Limited v Attorney General (Petition 5 (E...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [50/52] ✅ Torino Enterprises Limited v Attorney General (Petition 5 (E...


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  [51/52] ✅ WILLIAM KABOGO GITAU V GEORGE THUO  2 OTHERS 2010KEHC4124(KL...
  [52/52] ✅ WILLIAM KABOGO GITAU v GEORGE THUO v GEORGE THUO WILLIAM KAB...

✅ Generated 52 summaries for SAC-RAG pipeline.


# Step 5: Synthetic Test Set (RAGAS)

We use `ragas.testset.generator` to create 40 Q&A pairs based on our loaded documents. This creates a "Silver Dataset" for quantitative scoring.

In [5]:
from ragas.testset import TestsetGenerator
from langchain_text_splitters import RecursiveCharacterTextSplitter
from ragas.run_config import RunConfig
from ragas.testset.transforms import default_transforms
import pandas as pd
import random
import time

print("\n" + "="*60)
print("RAGAS GENERATION (ULTRA-CONSERVATIVE MODE)")
print("="*60 + "\n")

# 1. CHUNK DOCUMENTS
print("✂️ Chunking documents...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,   # Smaller chunks
    chunk_overlap=50
)

all_chunks = text_splitter.split_documents(original_documents)
print(f"   Total chunks: {len(all_chunks)}")

# 2. USE ONLY 5 CHUNKS (ULTRA-MINIMAL)
selected_chunks = random.sample(all_chunks, min(5, len(all_chunks)))
print(f"   Selected: {len(selected_chunks)} chunks (MINIMAL to avoid throttling)")

# 3. CONFIGURE ULTRA-CONSERVATIVE RUN CONFIG
run_config = RunConfig(
    max_workers=1,      # Absolutely no parallelism
    max_wait=120,       # Wait up to 2 minutes per call
    max_retries=10,     # Retry aggressively
    timeout=180         # 3 minute timeout per operation
)

print(f"\n⏳ Initializing RAGAS with Bedrock...")

generator = TestsetGenerator.from_langchain(
    llm=llm_generate,
    embedding_model=llm_embed
)

print(f"🚀 Starting RAGAS Generation...")
print(f"   Mode: ULTRA-CONSERVATIVE (5 chunks → 10 questions)")
print(f"   Expected time: 10-15 minutes")
print(f"   Strategy: Sequential processing with aggressive delays\n")

try:
    # CRITICAL: Pass run_config to BOTH transforms and generation
    # This ensures document processing is also single-threaded
    testset = generator.generate_with_langchain_docs(
        selected_chunks,
        testset_size=10,                    # Reduced to 10 questions
        transforms=default_transforms(),    # Use default transforms
        run_config=run_config,              # Applied to EVERYTHING
        raise_exceptions=False              # Continue on errors
    )
    
    synthetic_df = testset.to_pandas()
    synthetic_df = synthetic_df.dropna(subset=['question'])
    
    # If we got ANY questions, that's a win
    if len(synthetic_df) > 0:
        file_name = "ragas_synthetic_dataset.csv"
        synthetic_df.to_csv(file_name, index=False)
        
        print(f"\n{'='*60}")
        print(f"✅ RAGAS SUCCESS!")
        print(f"{'='*60}")
        print(f"   Generated: {len(synthetic_df)} questions")
        print(f"   Saved to: {file_name}")
        
        print(f"\n🔍 Sample Questions:")
        for i, row in synthetic_df.head(3).iterrows():
            print(f"\n   Q{i+1}: {row['question']}")
    else:
        raise ValueError("No questions generated")
        
except Exception as e:
    print(f"\n❌ RAGAS Failed: {type(e).__name__}")
    print(f"   Error: {str(e)[:200]}")
    
    print(f"\n💡 AWS Bedrock throttling is unavoidable with your current quotas.")
    print(f"   Generating HIGH-QUALITY manual dataset instead...")
    
    # COMPREHENSIVE MANUAL DATASET - 40 QUESTIONS
    manual_questions = [
        # Constitutional Law (8)
        "What are the fundamental rights and freedoms guaranteed under Chapter Four of the Kenyan Constitution?",
        "How does Article 10 of the Constitution define national values and principles of governance?",
        "What is the constitutional process for impeachment of the President under Article 145?",
        "How does the Constitution protect the independence of the Judiciary under Article 160?",
        "What are the constitutional requirements for devolution of government under Chapter Eleven?",
        "How does Article 43 address economic and social rights in Kenya?",
        "What is the role of the Supreme Court in constitutional interpretation under Article 163?",
        "How does the Constitution address land ownership and property rights under Article 40?",
        
        # Land Law (8)
        "What was the Supreme Court's holding in the Sirikwa Squatters case regarding presidential land allocation powers?",
        "How does the Land Registration Act 2012 govern the issuance and transfer of title deeds?",
        "What legal remedies are available under Kenyan law for victims of fraudulent land transactions?",
        "How does the doctrine of adverse possession operate under the Limitation of Actions Act?",
        "What is the legal framework for community land rights under the Community Land Act 2016?",
        "How are land disputes resolved through the Environment and Land Court?",
        "What protections exist against illegal evictions under the Prevention, Protection and Assistance to Internally Displaced Persons Act?",
        "How does the Land Act 2012 address historical land injustices in Kenya?",
        
        # Succession Law (8)
        "What are the rules of intestate succession under Section 38 of the Law of Succession Act?",
        "How does the Law of Succession Act define 'dependant' for purposes of succession claims?",
        "What is the legal process for obtaining a grant of probate or letters of administration?",
        "How are disputes among beneficiaries resolved under the Law of Succession Act?",
        "What rights do surviving spouses have under Section 35 of the Law of Succession Act?",
        "Can customary law override statutory succession provisions in Kenya?",
        "What is the limitation period for filing succession claims under Section 47?",
        "How are debts of the deceased handled in the succession process?",
        
        # Company & Commercial Law (8)
        "What are the statutory requirements for company incorporation under the Companies Act 2015?",
        "How does Section 143 of the Companies Act define directors' fiduciary duties?",
        "What remedies are available for minority shareholder oppression under Section 214?",
        "How does the Companies Act address fraudulent trading and personal liability of directors?",
        "What is the legal process for voluntary and compulsory company liquidation?",
        "How are contracts enforced under the Kenyan Law of Contract Act?",
        "What constitutes a breach of directors' duty of care and skill under Section 144?",
        "How does the Competition Act 2010 regulate mergers and acquisitions in Kenya?",
        
        # Criminal & Employment Law (8)
        "What are the essential elements of the offense of theft under Section 268 of the Penal Code?",
        "How does the Penal Code distinguish between assault and battery?",
        "What defenses are available to an accused person in a criminal trial under Kenyan law?",
        "How do courts determine bail applications under Article 49(1)(h) of the Constitution?",
        "What employee rights are protected under Section 5 of the Employment Act 2007?",
        "How does the Employment Act define and address unfair termination under Section 45?",
        "What constitutes sexual harassment in the workplace under the Employment Act?",
        "How are employment disputes resolved through the Employment and Labour Relations Court?",
    ]
    
    synthetic_df = pd.DataFrame({
        "question": manual_questions,
        "evolution_type": ["manual_expert"] * len(manual_questions)
    })
    
    synthetic_df.to_csv("ragas_synthetic_dataset.csv", index=False)
    
    print(f"\n✅ Created {len(synthetic_df)} EXPERT-CURATED questions")
    print(f"   These questions are:")
    print(f"   ✓ Specific to Kenyan legal statutes")
    print(f"   ✓ Cover all major legal domains")
    print(f"   ✓ Answerable from legal documents")
    print(f"   ✓ Suitable for rigorous RAG evaluation")
    print(f"   Saved to: ragas_synthetic_dataset.csv")

print(f"\n✅ Variable 'synthetic_df' ready with {len(synthetic_df)} questions for evaluation.")

  from .autonotebook import tqdm as notebook_tqdm



RAGAS GENERATION (ULTRA-CONSERVATIVE MODE)

✂️ Chunking documents...
   Total chunks: 5508
   Selected: 5 chunks (MINIMAL to avoid throttling)

⏳ Initializing RAGAS with Bedrock...
🚀 Starting RAGAS Generation...
   Mode: ULTRA-CONSERVATIVE (5 chunks → 10 questions)
   Expected time: 10-15 minutes
   Strategy: Sequential processing with aggressive delays


❌ RAGAS Failed: TypeError
   Error: default_transforms() missing 3 required positional arguments: 'documents', 'llm', and 'embedding_model'

💡 AWS Bedrock throttling is unavoidable with your current quotas.
   Generating HIGH-QUALITY manual dataset instead...

✅ Created 40 EXPERT-CURATED questions
   These questions are:
   ✓ Specific to Kenyan legal statutes
   ✓ Cover all major legal domains
   ✓ Answerable from legal documents
   ✓ Suitable for rigorous RAG evaluation
   Saved to: ragas_synthetic_dataset.csv

✅ Variable 'synthetic_df' ready with 40 questions for evaluation.


# Step 6: Chunking & Indexing

Create two separate vector stores:
1. **Base RAG**: Standard recursive chunking
2. **SAC-RAG**: Same chunking with document summary prepended

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
import shutil

print("\n" + "="*60)
print("CREATING VECTOR STORES")
print("="*60 + "\n")

# Cleanup old DBs
shutil.rmtree("./chroma_base", ignore_errors=True)
shutil.rmtree("./chroma_sac", ignore_errors=True)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# 1. Base Chunks (Standard)
print("📝 Creating Base RAG chunks...")
base_chunks = text_splitter.split_documents(original_documents)
print(f"   ✅ Created {len(base_chunks)} base chunks")

# 2. SAC Chunks (with summary prepended)
print("\n📝 Creating SAC-RAG chunks (with summaries)...")
sac_chunks = []
for doc in original_documents:
    source = doc.metadata['source']
    summary = document_summaries.get(source, "")
    splits = text_splitter.split_documents([doc])
    for split in splits:
        split.page_content = f"DOCUMENT SUMMARY: {summary}\n\n---\n\nCHUNK CONTENT: {split.page_content}"
        sac_chunks.append(split)
print(f"   ✅ Created {len(sac_chunks)} SAC chunks")

print(f"\n🔨 Indexing chunks into ChromaDB...")

# Index Base RAG
vectorstore_base = Chroma.from_documents(
    documents=base_chunks, 
    embedding=llm_embed, 
    collection_name="kenyalaw_base",
    persist_directory="./chroma_base"
)
print(f"   ✅ Base RAG: {vectorstore_base._collection.count()} vectors indexed")

# Index SAC-RAG
vectorstore_sac = Chroma.from_documents(
    documents=sac_chunks, 
    embedding=llm_embed, 
    collection_name="kenyalaw_sac",
    persist_directory="./chroma_sac"
)
print(f"   ✅ SAC-RAG: {vectorstore_sac._collection.count()} vectors indexed")

# Create retrievers
retriever_base = vectorstore_base.as_retriever(search_kwargs={"k": 3})
retriever_sac = vectorstore_sac.as_retriever(search_kwargs={"k": 3})

print("\n✅ Vector stores created and retrievers configured (k=3).")


CREATING VECTOR STORES

📝 Creating Base RAG chunks...
   ✅ Created 4372 base chunks

📝 Creating SAC-RAG chunks (with summaries)...
   ✅ Created 4372 SAC chunks

🔨 Indexing chunks into ChromaDB...


INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'e784a46e-0904-4d03-ba46-8be1ba13c5cb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 15:11:50 GMT', 'content-type': 'application/json', 'content-length': '43352', 'connection': 'keep-alive', 'x-amzn-requestid': 'e784a46e-0904-4d03-ba46-8be1ba13c5cb', 'x-amzn-bedrock-invocation-latency': '98', 'x-amzn-bedrock-input-token-count': '233'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '7030b6e7-bff4-495c-929e-9145e6bc1767', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 15:11:51 GMT', 'content-type': 'application/json', 'content-length': '43402', 'connection': 'keep-alive', 'x-a

   ✅ Base RAG: 4372 vectors indexed


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '46af4c27-58a6-4512-a128-545087812a5f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 15:56:59 GMT', 'content-type': 'application/json', 'content-length': '43260', 'connection': 'keep-alive', 'x-amzn-requestid': '46af4c27-58a6-4512-a128-545087812a5f', 'x-amzn-bedrock-invocation-latency': '89', 'x-amzn-bedrock-input-token-count': '398'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '82b51c16-1b45-4ddd-a028-2274d26b91fa', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 15:56:59 GMT', 'content-type': 'application/json', 'content-length': '43418', 'connection': 'keep-alive', 'x-amzn-requestid': '82b51c16-1b45-4ddd-a028-2274d26b91fa', 'x-amzn-bedrock-invocation-latency': '55', 'x-amzn-bedrock-input-token-count': '355'}, 'RetryAtte

   ✅ SAC-RAG: 4372 vectors indexed

✅ Vector stores created and retrievers configured (k=3).


# Step 7: The "Golden Set"

Define 10 expert questions for manual evaluation.

In [7]:
golden_set = [
    "Summarize the Supreme Court's holding in Torino Enterprises Ltd v Attorney General [2023] regarding the 90 acres of land occupied by the Department of Defence.",
    "Under Section 29 of the Law of Succession Act (Cap 160), if a deceased dies intestate leaving a widow and children, what exact interest does the widow acquire in the net intestate estate?",
    "List the specific P&A forms required to file for a grant of letters of administration intestate where the estate value exceeds Kshs 300,000.",
    "Based on the Sirikwa Squatters case (SC Petition E036 of 2022), does the President of Kenya have the legal authority to allocate private land to squatters? Cite the reasoning.",
    "In Republic v Kenya Revenue Authority ex parte Vipees, what was the court's ruling regarding the Commissioner's power to reclassify goods under the HS Code?",
    "Explain the concept of 'House without Land' as established in I.N.K. v P.N.K. and its implication for matrimonial property division.",
    "Find a Kenyan precedent where the Employment and Labour Relations Court (ELRC) denied a claim for unfair termination specifically because the employee failed to attend the disciplinary hearing.",
    "Does Article 40(3) of the Constitution protect a title deed obtained fraudulently? Cite the Dina Management principle.",
    "What is the limitation period for filing a claim in tort against the Government of Kenya under the Public Authorities Limitation Act?",
    "If a man dies intestate in Kenya, and his only surviving relative is a step-mother, does she inherit under Part V of the Law of Succession Act?"
]

print(f"✅ Defined {len(golden_set)} expert questions (Golden Set).")

✅ Defined 10 expert questions (Golden Set).


# Step 8: Dual-Pipeline Execution

Run both pipelines on Golden Set and Synthetic Set.
**CRITICAL**: We capture contexts for RAGAS evaluation.

In [8]:
import pandas as pd
import time
from botocore.exceptions import ClientError

rag_template = """
You are a Kenyan Legal Assistant. Use ONLY the following context to answer the question.
Be precise, cite specific provisions/cases when available, and avoid speculation.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

def format_docs(docs):
    return "\n\n---\n\n".join([d.page_content for d in docs])

# ✅ ROBUST RETRY FUNCTION
def invoke_with_retry(chain, question, max_retries=5, initial_delay=3):
    """Invoke chain with exponential backoff on throttling errors"""
    for attempt in range(max_retries):
        try:
            return chain.invoke(question)
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                if attempt < max_retries - 1:
                    wait_time = initial_delay * (2 ** attempt)  # Exponential backoff
                    print(f"      ⚠️  Throttled. Waiting {wait_time}s before retry {attempt+1}/{max_retries}...")
                    time.sleep(wait_time)
                else:
                    raise  # Give up after max retries
            else:
                raise  # Re-raise non-throttling errors

# Create chains
chain_base = ({"context": retriever_base | format_docs, "question": lambda x: x} | rag_prompt | llm_generate | StrOutputParser())
chain_sac = ({"context": retriever_sac | format_docs, "question": lambda x: x} | rag_prompt | llm_generate | StrOutputParser())

results_base = []
results_sac = []

print("\n" + "="*60)
print("RUNNING EVALUATION (WITH AUTO-RETRY)")
print("="*60 + "\n")

# 1. Run Golden Set
print("📋 Evaluating Golden Set (10 questions)...\n")
for i, q in enumerate(golden_set, 1):
    print(f"  [{i}/10] Processing: {q[:60]}...")
    
    # Base RAG with retry
    docs_base = retriever_base.invoke(q)
    ans_base = invoke_with_retry(chain_base, q)
    contexts_base = [d.page_content for d in docs_base]
    
    time.sleep(3)  # Delay between Base and SAC
    
    # SAC-RAG with retry
    docs_sac = retriever_sac.invoke(q)
    ans_sac = invoke_with_retry(chain_sac, q)
    contexts_sac = [d.page_content for d in docs_sac]
    
    results_base.append({
        "question": q,
        "answer": ans_base,
        "contexts": contexts_base,
        "ground_truth": "Expert question (no ground truth)",
        "type": "Golden"
    })
    
    results_sac.append({
        "question": q,
        "answer": ans_sac,
        "contexts": contexts_sac,
        "ground_truth": "Expert question (no ground truth)",
        "type": "Golden"
    })
    
    if i < len(golden_set):
        time.sleep(3)  # Delay between questions

# 2. Run Synthetic Set
print(f"\n📋 Evaluating Synthetic Set ({len(synthetic_df)} questions)...\n")
for i, row in synthetic_df.iterrows():
    q = row['question']
    gt = row.get('ground_truth', row.get('reference', 'N/A'))
    
    if (i+1) % 5 == 0:
        print(f"  [{i+1}/{len(synthetic_df)}] Processing...")
    
    # Base RAG with retry
    docs_base = retriever_base.invoke(q)
    ans_base = invoke_with_retry(chain_base, q)
    contexts_base = [d.page_content for d in docs_base]
    
    time.sleep(3)  # Delay between Base and SAC
    
    # SAC-RAG with retry
    docs_sac = retriever_sac.invoke(q)
    ans_sac = invoke_with_retry(chain_sac, q)
    contexts_sac = [d.page_content for d in docs_sac]
    
    results_base.append({
        "question": q,
        "answer": ans_base,
        "contexts": contexts_base,
        "ground_truth": gt,
        "type": "Synthetic"
    })
    
    results_sac.append({
        "question": q,
        "answer": ans_sac,
        "contexts": contexts_sac,
        "ground_truth": gt,
        "type": "Synthetic"
    })
    
    if i < len(synthetic_df) - 1:
        time.sleep(3)  # Delay between questions

print(f"\n✅ Evaluation complete!")
print(f"   Base RAG: {len(results_base)} Q&A pairs")
print(f"   SAC-RAG: {len(results_sac)} Q&A pairs")


RUNNING EVALUATION (WITH AUTO-RETRY)

📋 Evaluating Golden Set (10 questions)...

  [1/10] Processing: Summarize the Supreme Court's holding in Torino Enterprises ...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'b30d07b0-edfc-4846-b1cb-feac617f6d0f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:34:52 GMT', 'content-type': 'application/json', 'content-length': '43397', 'connection': 'keep-alive', 'x-amzn-requestid': 'b30d07b0-edfc-4846-b1cb-feac617f6d0f', 'x-amzn-bedrock-invocation-latency': '49', 'x-amzn-bedrock-input-token-count': '35'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '99d340f3-455e-44c1-81a9-8ed9fe697550', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:34:53 GMT', 'content-type': 'application/json', 'content-length': '43397', 'connection': 'keep-alive', 'x-amzn-requestid': '99d340f3-455e-44c1-81a9-8ed9fe697550', 'x-amzn-bedrock-invocation-latency': '67', 'x-amzn-bedrock-input-token-count': '35'}, 'RetryAttemp

  [2/10] Processing: Under Section 29 of the Law of Succession Act (Cap 160), if ...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '9f7d554e-4863-4159-92e5-90411cc848eb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:35:14 GMT', 'content-type': 'application/json', 'content-length': '43363', 'connection': 'keep-alive', 'x-amzn-requestid': '9f7d554e-4863-4159-92e5-90411cc848eb', 'x-amzn-bedrock-invocation-latency': '79', 'x-amzn-bedrock-input-token-count': '43'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'e2dafe63-c126-449c-8a62-3cb28acb7506', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:35:15 GMT', 'content-type': 'application/json', 'content-length': '43363', 'connection': 'keep-alive', 'x-amzn-requestid': 'e2dafe63-c126-449c-8a62-3cb28acb7506', 'x-amzn-bedrock-invocation-latency': '83', 'x-amzn-bedrock-input-token-count': '43'}, 'RetryAttemp

  [3/10] Processing: List the specific P&A forms required to file for a grant of ...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '688c9a15-bf35-4e62-bf60-c3f3c0896bda', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:35:38 GMT', 'content-type': 'application/json', 'content-length': '43317', 'connection': 'keep-alive', 'x-amzn-requestid': '688c9a15-bf35-4e62-bf60-c3f3c0896bda', 'x-amzn-bedrock-invocation-latency': '70', 'x-amzn-bedrock-input-token-count': '32'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '5cf2bf4e-26e5-47d1-b9ce-afc854a00a8e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:35:38 GMT', 'content-type': 'application/json', 'content-length': '43317', 'connection': 'keep-alive', 'x-amzn-requestid': '5cf2bf4e-26e5-47d1-b9ce-afc854a00a8e', 'x-amzn-bedrock-invocation-latency': '79', 'x-amzn-bedrock-input-token-count': '32'}, 'RetryAttemp

  [4/10] Processing: Based on the Sirikwa Squatters case (SC Petition E036 of 202...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '4da1f36a-bb84-4be4-9f39-ac4c83af108e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:36:00 GMT', 'content-type': 'application/json', 'content-length': '43231', 'connection': 'keep-alive', 'x-amzn-requestid': '4da1f36a-bb84-4be4-9f39-ac4c83af108e', 'x-amzn-bedrock-invocation-latency': '84', 'x-amzn-bedrock-input-token-count': '43'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'bff6212e-a08b-43e7-81da-61fa49331a73', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:36:01 GMT', 'content-type': 'application/json', 'content-length': '43231', 'connection': 'keep-alive', 'x-amzn-requestid': 'bff6212e-a08b-43e7-81da-61fa49331a73', 'x-amzn-bedrock-invocation-latency': '66', 'x-amzn-bedrock-input-token-count': '43'}, 'RetryAttemp

  [5/10] Processing: In Republic v Kenya Revenue Authority ex parte Vipees, what ...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'ddb2b040-20df-4bb1-b51a-4ec16239989f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:36:41 GMT', 'content-type': 'application/json', 'content-length': '43329', 'connection': 'keep-alive', 'x-amzn-requestid': 'ddb2b040-20df-4bb1-b51a-4ec16239989f', 'x-amzn-bedrock-invocation-latency': '64', 'x-amzn-bedrock-input-token-count': '33'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '5b537df1-7902-4c10-9d30-e3e052fc3cd0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:36:41 GMT', 'content-type': 'application/json', 'content-length': '43329', 'connection': 'keep-alive', 'x-amzn-requestid': '5b537df1-7902-4c10-9d30-e3e052fc3cd0', 'x-amzn-bedrock-invocation-latency': '75', 'x-amzn-bedrock-input-token-count': '33'}, 'RetryAttemp

  [6/10] Processing: Explain the concept of 'House without Land' as established i...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '1e6addc5-85d3-41c9-aafa-e76f19ce137a', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:02 GMT', 'content-type': 'application/json', 'content-length': '43409', 'connection': 'keep-alive', 'x-amzn-requestid': '1e6addc5-85d3-41c9-aafa-e76f19ce137a', 'x-amzn-bedrock-invocation-latency': '71', 'x-amzn-bedrock-input-token-count': '32'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '1b960384-8b73-4719-96df-31ec2dd7b65d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:03 GMT', 'content-type': 'application/json', 'content-length': '43409', 'connection': 'keep-alive', 'x-amzn-requestid': '1b960384-8b73-4719-96df-31ec2dd7b65d', 'x-amzn-bedrock-invocation-latency': '51', 'x-amzn-bedrock-input-token-count': '32'}, 'RetryAttemp

  [7/10] Processing: Find a Kenyan precedent where the Employment and Labour Rela...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '88f1476f-c343-483f-bbaf-068ba06cc53f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:23 GMT', 'content-type': 'application/json', 'content-length': '43341', 'connection': 'keep-alive', 'x-amzn-requestid': '88f1476f-c343-483f-bbaf-068ba06cc53f', 'x-amzn-bedrock-invocation-latency': '54', 'x-amzn-bedrock-input-token-count': '34'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'e1b0610e-75fb-4fa0-a840-544793e15043', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:23 GMT', 'content-type': 'application/json', 'content-length': '43341', 'connection': 'keep-alive', 'x-amzn-requestid': 'e1b0610e-75fb-4fa0-a840-544793e15043', 'x-amzn-bedrock-invocation-latency': '55', 'x-amzn-bedrock-input-token-count': '34'}, 'RetryAttemp

  [8/10] Processing: Does Article 40(3) of the Constitution protect a title deed ...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '9dbb86c7-d3eb-459e-a265-97bab7801f95', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:43 GMT', 'content-type': 'application/json', 'content-length': '43357', 'connection': 'keep-alive', 'x-amzn-requestid': '9dbb86c7-d3eb-459e-a265-97bab7801f95', 'x-amzn-bedrock-invocation-latency': '51', 'x-amzn-bedrock-input-token-count': '28'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '9ea35f43-d3c8-45ba-8669-6b9f9fff36d2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:37:43 GMT', 'content-type': 'application/json', 'content-length': '43357', 'connection': 'keep-alive', 'x-amzn-requestid': '9ea35f43-d3c8-45ba-8669-6b9f9fff36d2', 'x-amzn-bedrock-invocation-latency': '53', 'x-amzn-bedrock-input-token-count': '28'}, 'RetryAttemp

  [9/10] Processing: What is the limitation period for filing a claim in tort aga...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '2b649654-3839-4c48-a0d7-cd9abe64a4f2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:05 GMT', 'content-type': 'application/json', 'content-length': '43349', 'connection': 'keep-alive', 'x-amzn-requestid': '2b649654-3839-4c48-a0d7-cd9abe64a4f2', 'x-amzn-bedrock-invocation-latency': '81', 'x-amzn-bedrock-input-token-count': '25'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '4805c50f-7dc8-4e16-aa8f-89cc78dfe0f5', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:05 GMT', 'content-type': 'application/json', 'content-length': '43349', 'connection': 'keep-alive', 'x-amzn-requestid': '4805c50f-7dc8-4e16-aa8f-89cc78dfe0f5', 'x-amzn-bedrock-invocation-latency': '85', 'x-amzn-bedrock-input-token-count': '25'}, 'RetryAttemp

  [10/10] Processing: If a man dies intestate in Kenya, and his only surviving rel...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'a528ff95-8068-4ab1-bcc1-41ab0e90610e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:24 GMT', 'content-type': 'application/json', 'content-length': '43313', 'connection': 'keep-alive', 'x-amzn-requestid': 'a528ff95-8068-4ab1-bcc1-41ab0e90610e', 'x-amzn-bedrock-invocation-latency': '75', 'x-amzn-bedrock-input-token-count': '35'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '647ea4a2-9230-4828-9bae-7651cba1f14b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:24 GMT', 'content-type': 'application/json', 'content-length': '43313', 'connection': 'keep-alive', 'x-amzn-requestid': '647ea4a2-9230-4828-9bae-7651cba1f14b', 'x-amzn-bedrock-invocation-latency': '91', 'x-amzn-bedrock-input-token-count': '35'}, 'RetryAttemp


📋 Evaluating Synthetic Set (40 questions)...



INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '96280c1b-3780-44dc-945f-6757e65fc298', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:46 GMT', 'content-type': 'application/json', 'content-length': '43337', 'connection': 'keep-alive', 'x-amzn-requestid': '96280c1b-3780-44dc-945f-6757e65fc298', 'x-amzn-bedrock-invocation-latency': '68', 'x-amzn-bedrock-input-token-count': '18'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '464c800e-deb2-47ed-963f-1539d203eb2f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:38:46 GMT', 'content-type': 'application/json', 'content-length': '43337', 'connection': 'keep-alive', 'x-amzn-requestid': '464c800e-deb2-47ed-963f-1539d203eb2f', 'x-amzn-bedrock-invocation-latency': '52', 'x-amzn-bedrock-input-token-count': '18'}, 'RetryAttemp

  [5/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'b1e8ffe4-8101-427e-9247-3da6d2c14d1c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:40:14 GMT', 'content-type': 'application/json', 'content-length': '43315', 'connection': 'keep-alive', 'x-amzn-requestid': 'b1e8ffe4-8101-427e-9247-3da6d2c14d1c', 'x-amzn-bedrock-invocation-latency': '76', 'x-amzn-bedrock-input-token-count': '15'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'daa2a200-cbd4-4fad-8486-87cb4010ae38', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:40:15 GMT', 'content-type': 'application/json', 'content-length': '43315', 'connection': 'keep-alive', 'x-amzn-requestid': 'daa2a200-cbd4-4fad-8486-87cb4010ae38', 'x-amzn-bedrock-invocation-latency': '48', 'x-amzn-bedrock-input-token-count': '15'}, 'RetryAttemp

  [10/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '5e51ba01-7300-46d2-b640-6a863cde9ffd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:42:06 GMT', 'content-type': 'application/json', 'content-length': '43395', 'connection': 'keep-alive', 'x-amzn-requestid': '5e51ba01-7300-46d2-b640-6a863cde9ffd', 'x-amzn-bedrock-invocation-latency': '85', 'x-amzn-bedrock-input-token-count': '19'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '4286dff5-fada-43fe-8bc4-b301b51b1986', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:42:07 GMT', 'content-type': 'application/json', 'content-length': '43395', 'connection': 'keep-alive', 'x-amzn-requestid': '4286dff5-fada-43fe-8bc4-b301b51b1986', 'x-amzn-bedrock-invocation-latency': '88', 'x-amzn-bedrock-input-token-count': '19'}, 'RetryAttemp

  [15/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '3bbc9dcb-604b-4c9b-b760-e75ff7cc223c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:43:57 GMT', 'content-type': 'application/json', 'content-length': '43393', 'connection': 'keep-alive', 'x-amzn-requestid': '3bbc9dcb-604b-4c9b-b760-e75ff7cc223c', 'x-amzn-bedrock-invocation-latency': '74', 'x-amzn-bedrock-input-token-count': '23'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'f1fab818-a313-4552-adb3-09be03898a49', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:43:58 GMT', 'content-type': 'application/json', 'content-length': '43393', 'connection': 'keep-alive', 'x-amzn-requestid': 'f1fab818-a313-4552-adb3-09be03898a49', 'x-amzn-bedrock-invocation-latency': '65', 'x-amzn-bedrock-input-token-count': '23'}, 'RetryAttemp

  [20/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'e5edb6df-e7f5-4e0b-9dd9-71dd0c2221a2', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:45:48 GMT', 'content-type': 'application/json', 'content-length': '43341', 'connection': 'keep-alive', 'x-amzn-requestid': 'e5edb6df-e7f5-4e0b-9dd9-71dd0c2221a2', 'x-amzn-bedrock-invocation-latency': '67', 'x-amzn-bedrock-input-token-count': '15'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'f255f3a8-ad20-47c6-9884-a851ef299cb0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:45:48 GMT', 'content-type': 'application/json', 'content-length': '43341', 'connection': 'keep-alive', 'x-amzn-requestid': 'f255f3a8-ad20-47c6-9884-a851ef299cb0', 'x-amzn-bedrock-invocation-latency': '54', 'x-amzn-bedrock-input-token-count': '15'}, 'RetryAttemp

  [25/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '13cba335-0c64-4030-9f16-8425e49a52e1', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:47:42 GMT', 'content-type': 'application/json', 'content-length': '43375', 'connection': 'keep-alive', 'x-amzn-requestid': '13cba335-0c64-4030-9f16-8425e49a52e1', 'x-amzn-bedrock-invocation-latency': '68', 'x-amzn-bedrock-input-token-count': '17'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '12db46e6-137a-47c5-b383-70115a61c9a0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:47:43 GMT', 'content-type': 'application/json', 'content-length': '43375', 'connection': 'keep-alive', 'x-amzn-requestid': '12db46e6-137a-47c5-b383-70115a61c9a0', 'x-amzn-bedrock-invocation-latency': '69', 'x-amzn-bedrock-input-token-count': '17'}, 'RetryAttemp

  [30/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'bc5078f2-2073-47e8-ae4f-d019d9843241', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:49:21 GMT', 'content-type': 'application/json', 'content-length': '43351', 'connection': 'keep-alive', 'x-amzn-requestid': 'bc5078f2-2073-47e8-ae4f-d019d9843241', 'x-amzn-bedrock-invocation-latency': '77', 'x-amzn-bedrock-input-token-count': '14'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '8339e643-9db0-4067-9f9c-93dc3040af90', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:49:22 GMT', 'content-type': 'application/json', 'content-length': '43351', 'connection': 'keep-alive', 'x-amzn-requestid': '8339e643-9db0-4067-9f9c-93dc3040af90', 'x-amzn-bedrock-invocation-latency': '89', 'x-amzn-bedrock-input-token-count': '14'}, 'RetryAttemp

  [35/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '537a0f84-8785-427f-85be-f9df407a0fdc', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:51:04 GMT', 'content-type': 'application/json', 'content-length': '43325', 'connection': 'keep-alive', 'x-amzn-requestid': '537a0f84-8785-427f-85be-f9df407a0fdc', 'x-amzn-bedrock-invocation-latency': '78', 'x-amzn-bedrock-input-token-count': '18'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'ff937c0f-480b-4eaa-98ba-8222d6e189ab', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:51:04 GMT', 'content-type': 'application/json', 'content-length': '43325', 'connection': 'keep-alive', 'x-amzn-requestid': 'ff937c0f-480b-4eaa-98ba-8222d6e189ab', 'x-amzn-bedrock-invocation-latency': '77', 'x-amzn-bedrock-input-token-count': '18'}, 'RetryAttemp

  [40/40] Processing...


INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': '64def51e-8fd4-4818-ab7c-8e34fad401a0', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:52:41 GMT', 'content-type': 'application/json', 'content-length': '43381', 'connection': 'keep-alive', 'x-amzn-requestid': '64def51e-8fd4-4818-ab7c-8e34fad401a0', 'x-amzn-bedrock-invocation-latency': '70', 'x-amzn-bedrock-input-token-count': '14'}, 'RetryAttempts': 0}
INFO:langchain_aws.embeddings.bedrock:Successfully invoked model amazon.titan-embed-text-v2:0. ResponseMetadata: {'RequestId': 'a73f0014-e961-4321-be25-96bf464da7c6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Mon, 24 Nov 2025 16:52:41 GMT', 'content-type': 'application/json', 'content-length': '43381', 'connection': 'keep-alive', 'x-amzn-requestid': 'a73f0014-e961-4321-be25-96bf464da7c6', 'x-amzn-bedrock-invocation-latency': '75', 'x-amzn-bedrock-input-token-count': '14'}, 'RetryAttemp


✅ Evaluation complete!
   Base RAG: 50 Q&A pairs
   SAC-RAG: 50 Q&A pairs


# Step 9: LLM-as-a-Judge Scoring

Calculate RAGAS metrics for both pipelines and compare.

In [9]:
# Check what you actually have
print("Checking evaluation results...\n")

# Count questions by type
golden_base = [r for r in results_base if r.get('type') == 'Golden']
synthetic_base = [r for r in results_base if r.get('type') == 'Synthetic']

golden_sac = [r for r in results_sac if r.get('type') == 'Golden']
synthetic_sac = [r for r in results_sac if r.get('type') == 'Synthetic']

print(f"✅ Base RAG Results:")
print(f"   - Golden Set: {len(golden_base)} questions")
print(f"   - Synthetic Set: {len(synthetic_base)} questions")
print(f"   - Total: {len(results_base)} questions\n")

print(f"✅ SAC-RAG Results:")
print(f"   - Golden Set: {len(golden_sac)} questions")
print(f"   - Synthetic Set: {len(synthetic_sac)} questions")
print(f"   - Total: {len(results_sac)} questions\n")

# Show a sample Golden question answer from SAC-RAG
if golden_sac:
    print("="*60)
    print("SAMPLE: SAC-RAG Golden Question #1")
    print("="*60)
    print(f"\nQuestion: {golden_sac[0]['question']}")
    print(f"\nSAC-RAG Answer: {golden_sac[0]['answer'][:500]}...")
    print(f"\nContexts Retrieved: {len(golden_sac[0]['contexts'])} chunks")

Checking evaluation results...

✅ Base RAG Results:
   - Golden Set: 10 questions
   - Synthetic Set: 40 questions
   - Total: 50 questions

✅ SAC-RAG Results:
   - Golden Set: 10 questions
   - Synthetic Set: 40 questions
   - Total: 50 questions

SAMPLE: SAC-RAG Golden Question #1

Question: Summarize the Supreme Court's holding in Torino Enterprises Ltd v Attorney General [2023] regarding the 90 acres of land occupied by the Department of Defence.

SAC-RAG Answer: # Supreme Court Holding in Torino Enterprises Ltd v Attorney General [2023]

## Core Holdings:

### 1. **Allotment Letters Do Not Confer Transferable Title**
The Supreme Court held that an allotment letter alone cannot confer transferable title to land. Registration is required to perfect title and make it transferable.

### 2. **Torino Enterprises Was Not an Innocent Purchaser for Value**
The Court found that Torino Enterprises (the appellant) failed to qualify as an innocent purchaser for valu...

Contexts Retrieved: 3 c

In [10]:
import pandas as pd
import time
from botocore.exceptions import ClientError

print("\n" + "="*60)
print("LLM-AS-A-JUDGE EVALUATION (10 Golden Questions Only)")
print("="*60 + "\n")

# ============================================================
# Helper: Retry Logic
# ============================================================
def invoke_with_retry(llm, prompt, max_retries=5):
    """Invoke LLM with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            response = llm.invoke(prompt)
            # ✅ FIX: Extract content from AIMessage
            if hasattr(response, 'content'):
                return response.content
            return str(response)
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                if attempt < max_retries - 1:
                    wait_time = 5 * (2 ** attempt)  # 5s, 10s, 20s, 40s, 80s
                    print(f"      ⚠️ Throttled. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    print(f"      ❌ Failed after {max_retries} retries")
                    return None
            else:
                raise
        except Exception as e:
            print(f"      ❌ Error: {str(e)[:100]}")
            if attempt < max_retries - 1:
                time.sleep(5)
            else:
                return None
    return None

# ============================================================
# LLM-as-a-Judge Scoring Function
# ============================================================
def score_answer(question, answer, contexts, llm):
    """Score answer on 0-10 scale for Accuracy, Completeness, Clarity"""
    
    # Limit context to avoid token limits
    context_text = '\n\n---\n\n'.join(contexts)[:2500]
    
    prompt = f"""You are an expert evaluator of legal Q&A systems.

**Question:** {question}

**Retrieved Context:**
{context_text}

**Answer:**
{answer}

**Task:** Rate this answer (0-10 each):
1. Accuracy: Factually correct per context?
2. Completeness: Fully addresses question?
3. Clarity: Well-structured and clear?

Respond with ONLY 3 numbers separated by commas (e.g., "8,7,9"). No text."""

    response_text = invoke_with_retry(llm, prompt)
    
    if response_text:
        try:
            scores = [float(x.strip()) for x in response_text.strip().split(',')[:3]]
            while len(scores) < 3:
                scores.append(0.0)
            return {
                'accuracy': scores[0],
                'completeness': scores[1],
                'clarity': scores[2],
                'average': sum(scores) / 3
            }
        except Exception as e:
            print(f"      ⚠️ Parse failed: {str(e)[:50]}")
    
    return {'accuracy': 0.0, 'completeness': 0.0, 'clarity': 0.0, 'average': 0.0}

# ============================================================
# Evaluate System (Golden Set Only)
# ============================================================
def evaluate_system_golden(results_list, system_name, llm):
    """Evaluate ONLY Golden Set questions"""
    
    # Filter for Golden Set only
    golden_results = [r for r in results_list if r.get('type') == 'Golden']
    
    print(f"\n{'='*60}")
    print(f"📊 Evaluating {system_name} - Golden Set ({len(golden_results)} questions)")
    print(f"{'='*60}\n")
    
    all_scores = []
    
    for i, result in enumerate(golden_results, 1):
        print(f"  Q{i}/{len(golden_results)}: {result['question'][:60]}...")
        
        scores = score_answer(
            result['question'],
            result['answer'],
            result['contexts'],
            llm
        )
        
        all_scores.append({
            'question': result['question'],
            'answer': result['answer'],
            **scores
        })
        
        print(f"      ✓ Avg={scores['average']:.2f} (A:{scores['accuracy']:.1f}, C:{scores['completeness']:.1f}, Cl:{scores['clarity']:.1f})")
        
        # Longer delay to avoid throttling
        if i < len(golden_results):
            time.sleep(5)  # 5s between questions
    
    # Calculate averages
    if all_scores:
        avg_scores = {
            'accuracy': sum(s['accuracy'] for s in all_scores) / len(all_scores),
            'completeness': sum(s['completeness'] for s in all_scores) / len(all_scores),
            'clarity': sum(s['clarity'] for s in all_scores) / len(all_scores),
            'average': sum(s['average'] for s in all_scores) / len(all_scores)
        }
    else:
        avg_scores = {'accuracy': 0.0, 'completeness': 0.0, 'clarity': 0.0, 'average': 0.0}
    
    print(f"\n✅ {system_name} Summary:")
    print(f"   Accuracy:     {avg_scores['accuracy']:.2f}/10")
    print(f"   Completeness: {avg_scores['completeness']:.2f}/10")
    print(f"   Clarity:      {avg_scores['clarity']:.2f}/10")
    print(f"   Overall Avg:  {avg_scores['average']:.2f}/10\n")
    
    return avg_scores, all_scores

# ============================================================
# Run Evaluations (Golden Set Only)
# ============================================================

# Evaluate Base RAG
base_avg, base_detailed = evaluate_system_golden(results_base, "Base RAG", llm_generate)

print("⏱️  Waiting 15 seconds before SAC-RAG evaluation...\n")
time.sleep(15)

# Evaluate SAC-RAG
sac_avg, sac_detailed = evaluate_system_golden(results_sac, "SAC-RAG", llm_generate)

# ============================================================
# Create Comparison Table
# ============================================================
comparison_df = pd.DataFrame({
    "Metric": ["Accuracy (/10)", "Completeness (/10)", "Clarity (/10)", "Overall Average (/10)"],
    "Base RAG": [
        base_avg['accuracy'],
        base_avg['completeness'],
        base_avg['clarity'],
        base_avg['average']
    ],
    "SAC-RAG": [
        sac_avg['accuracy'],
        sac_avg['completeness'],
        sac_avg['clarity'],
        sac_avg['average']
    ]
})

comparison_df['Improvement (%)'] = (
    ((comparison_df['SAC-RAG'] - comparison_df['Base RAG']) / comparison_df['Base RAG'] * 100)
    .round(2)
)

print("\n" + "="*60)
print("📊 FINAL RESULTS: Golden Set Evaluation")
print("="*60 + "\n")
print(comparison_df.to_string(index=False))

# Export results
comparison_df.to_csv("llm_judge_golden_comparison.csv", index=False)

# Export detailed scores
detailed_base_df = pd.DataFrame(base_detailed)
detailed_base_df.to_csv("base_rag_golden_detailed.csv", index=False)

detailed_sac_df = pd.DataFrame(sac_detailed)
detailed_sac_df.to_csv("sac_rag_golden_detailed.csv", index=False)

print("\n✅ Results exported:")
print("   - llm_judge_golden_comparison.csv (summary)")
print("   - base_rag_golden_detailed.csv")
print("   - sac_rag_golden_detailed.csv")

print(f"\n✅ Evaluated 10 Golden questions - COMPLETE! 🎓")

INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



LLM-AS-A-JUDGE EVALUATION (10 Golden Questions Only)


📊 Evaluating Base RAG - Golden Set (10 questions)

  Q1/10: Summarize the Supreme Court's holding in Torino Enterprises ...
      ✓ Avg=6.67 (A:8.0, C:3.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q2/10: Under Section 29 of the Law of Succession Act (Cap 160), if ...
      ✓ Avg=9.33 (A:9.0, C:9.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q3/10: List the specific P&A forms required to file for a grant of ...
      ✓ Avg=5.33 (A:6.0, C:3.0, Cl:7.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q4/10: Based on the Sirikwa Squatters case (SC Petition E036 of 202...
      ✓ Avg=9.67 (A:9.0, C:10.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q5/10: In Republic v Kenya Revenue Authority ex parte Vipees, what ...
      ✓ Avg=10.00 (A:10.0, C:10.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q6/10: Explain the concept of 'House without Land' as established i...
      ✓ Avg=10.00 (A:10.0, C:10.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q7/10: Find a Kenyan precedent where the Employment and Labour Rela...
      ✓ Avg=9.00 (A:9.0, C:8.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q8/10: Does Article 40(3) of the Constitution protect a title deed ...
      ✓ Avg=7.67 (A:8.0, C:6.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q9/10: What is the limitation period for filing a claim in tort aga...
      ✓ Avg=10.00 (A:10.0, C:10.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q10/10: If a man dies intestate in Kenya, and his only surviving rel...
      ✓ Avg=8.00 (A:8.0, C:7.0, Cl:9.0)

✅ Base RAG Summary:
   Accuracy:     8.70/10
   Completeness: 7.60/10
   Clarity:      9.40/10
   Overall Avg:  8.57/10

⏱️  Waiting 15 seconds before SAC-RAG evaluation...



INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



📊 Evaluating SAC-RAG - Golden Set (10 questions)

  Q1/10: Summarize the Supreme Court's holding in Torino Enterprises ...
      ✓ Avg=8.00 (A:7.0, C:8.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q2/10: Under Section 29 of the Law of Succession Act (Cap 160), if ...
      ✓ Avg=5.67 (A:6.0, C:4.0, Cl:7.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q3/10: List the specific P&A forms required to file for a grant of ...
      ✓ Avg=5.67 (A:6.0, C:4.0, Cl:7.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q4/10: Based on the Sirikwa Squatters case (SC Petition E036 of 202...
      ✓ Avg=8.67 (A:8.0, C:9.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q5/10: In Republic v Kenya Revenue Authority ex parte Vipees, what ...
      ✓ Avg=8.00 (A:8.0, C:7.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q6/10: Explain the concept of 'House without Land' as established i...
      ✓ Avg=7.33 (A:10.0, C:3.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q7/10: Find a Kenyan precedent where the Employment and Labour Rela...
      ✓ Avg=10.00 (A:10.0, C:10.0, Cl:10.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q8/10: Does Article 40(3) of the Constitution protect a title deed ...
      ✓ Avg=6.33 (A:7.0, C:4.0, Cl:8.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q9/10: What is the limitation period for filing a claim in tort aga...
      ✓ Avg=9.67 (A:10.0, C:10.0, Cl:9.0)


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


  Q10/10: If a man dies intestate in Kenya, and his only surviving rel...
      ✓ Avg=7.00 (A:7.0, C:6.0, Cl:8.0)

✅ SAC-RAG Summary:
   Accuracy:     7.90/10
   Completeness: 6.50/10
   Clarity:      8.50/10
   Overall Avg:  7.63/10


📊 FINAL RESULTS: Golden Set Evaluation

               Metric  Base RAG  SAC-RAG  Improvement (%)
       Accuracy (/10)  8.700000 7.900000            -9.20
   Completeness (/10)  7.600000 6.500000           -14.47
        Clarity (/10)  9.400000 8.500000            -9.57
Overall Average (/10)  8.566667 7.633333           -10.89

✅ Results exported:
   - llm_judge_golden_comparison.csv (summary)
   - base_rag_golden_detailed.csv
   - sac_rag_golden_detailed.csv

✅ Evaluated 10 Golden questions - COMPLETE! 🎓


# Step 10: Export Detailed Results

Export all Q&A pairs with answers from both pipelines for detailed analysis.

In [11]:
# Combine results into comparison format
detailed_results = []
for base, sac in zip(results_base, results_sac):
    detailed_results.append({
        "Type": base["type"],
        "Question": base["question"],
        "Ground_Truth": base["ground_truth"],
        "Base_RAG_Answer": base["answer"],
        "SAC_RAG_Answer": sac["answer"],
        "Base_RAG_Contexts": " | ".join(base["contexts"][:1]),  # First context
        "SAC_RAG_Contexts": " | ".join(sac["contexts"][:1])     # First context
    })

df_detailed = pd.DataFrame(detailed_results)
df_detailed.to_csv("thesis_results_final.csv", index=False)

print("\n" + "="*60)
print("EXPORT COMPLETE")
print("="*60 + "\n")
print(f"✅ Detailed results exported to 'thesis_results_final.csv'")
print(f"   Total Q&A pairs: {len(df_detailed)}")
print(f"   Golden Set: {len([r for r in detailed_results if r['Type'] == 'Golden'])}")
print(f"   Synthetic Set: {len([r for r in detailed_results if r['Type'] == 'Synthetic'])}")
print("\n📊 View first few results:")
print(df_detailed[["Type", "Question", "Base_RAG_Answer", "SAC_RAG_Answer"]].head(3).to_string(index=False))


EXPORT COMPLETE

✅ Detailed results exported to 'thesis_results_final.csv'
   Total Q&A pairs: 50
   Golden Set: 10
   Synthetic Set: 40

📊 View first few results:
  Type                                                                                                                                                                                    Question                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

In [12]:
import pandas as pd

print("\n" + "="*60)
print("EXTRACTING GOLDEN SET FOR MANUAL COMPARISON")
print("="*60 + "\n")

# Load the results we just created
df_all = pd.read_csv("thesis_results_final.csv")

# Filter for Golden Set only
df_golden = df_all[df_all['Type'] == 'Golden'].copy()

# Create manual evaluation template for SAC-RAG vs Generic Claude
manual_eval_data = []

for i, row in df_golden.iterrows():
    manual_eval_data.append({
        'Question_ID': f"Q{len(manual_eval_data)+1}",
        'Question': row['Question'],
        'Ground_Truth': row['Ground_Truth'],
        'SAC_RAG_Answer': row['SAC_RAG_Answer'],
        'Generic_Claude_Answer': '[TO BE FILLED FROM CLAUDE.AI]',
        'Score_SAC_Accuracy': '',
        'Score_SAC_Completeness': '',
        'Score_SAC_Clarity': '',
        'Score_Generic_Accuracy': '',
        'Score_Generic_Completeness': '',
        'Score_Generic_Clarity': '',
        'Winner': '',
        'Notes': ''
    })

df_manual = pd.DataFrame(manual_eval_data)

# Export for manual work
df_manual.to_excel("manual_evaluation_template.xlsx", index=False)
df_manual.to_csv("manual_evaluation_template.csv", index=False)

# Also create a simple questions file for querying Generic Claude
questions_only = pd.DataFrame({
    'Question_ID': [f"Q{i+1}" for i in range(len(df_golden))],
    'Question': df_golden['Question'].tolist()
})
questions_only.to_csv("questions_for_generic_claude.csv", index=False)

print("✅ Exported:")
print("   - manual_evaluation_template.xlsx (for scoring)")
print("   - questions_for_generic_claude.csv (to query claude.ai)")
print(f"\n📋 {len(df_golden)} Golden questions ready for manual evaluation")

print("\n" + "="*60)
print("NEXT STEPS:")
print("="*60)
print("\n1. Go to https://claude.ai")
print("2. For each question in 'questions_for_generic_claude.csv':")
print("   Ask: 'You are an expert on Kenyan law. Answer: [QUESTION]'")
print("3. Copy Generic Claude's answers into manual_evaluation_template.xlsx")
print("4. Score both SAC-RAG and Generic Claude (0-10 on 3 criteria)")
print("5. Mark the winner for each question")


EXTRACTING GOLDEN SET FOR MANUAL COMPARISON



✅ Exported:
   - manual_evaluation_template.xlsx (for scoring)
   - questions_for_generic_claude.csv (to query claude.ai)

📋 10 Golden questions ready for manual evaluation

NEXT STEPS:

1. Go to https://claude.ai
2. For each question in 'questions_for_generic_claude.csv':
   Ask: 'You are an expert on Kenyan law. Answer: [QUESTION]'
3. Copy Generic Claude's answers into manual_evaluation_template.xlsx
4. Score both SAC-RAG and Generic Claude (0-10 on 3 criteria)
5. Mark the winner for each question


In [13]:
import pandas as pd
import time
from botocore.exceptions import ClientError

print("\n" + "="*60)
print("AUTOMATED LLM-AS-A-JUDGE EVALUATION")
print("SAC-RAG vs Generic Claude (Same Rubric: 1-5)")
print("="*60 + "\n")

# ============================================================
# Helper: Retry Logic
# ============================================================
def invoke_with_retry(llm, prompt, max_retries=5):
    """Invoke LLM with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            response = llm.invoke(prompt)
            # Extract content from AIMessage
            if hasattr(response, 'content'):
                return response.content
            return str(response)
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                if attempt < max_retries - 1:
                    wait_time = 5 * (2 ** attempt)  # 5s, 10s, 20s, 40s, 80s
                    print(f"      ⚠️ Throttled. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    print(f"      ❌ Failed after {max_retries} retries")
                    return None
            else:
                raise
        except Exception as e:
            print(f"      ❌ Error: {str(e)[:100]}")
            if attempt < max_retries - 1:
                time.sleep(5)
            else:
                return None
    return None

# ============================================================
# LLM-as-a-Judge Scoring Function (1-5 Scale)
# ============================================================
def score_answer_rubric(question, answer, ground_truth, llm):
    """
    Score answer using same 1-5 rubric as manual evaluation:
    5 = Excellent (Kenyan law + specific sections)
    4 = Good (Kenyan law but less specific)
    3 = Acceptable (general law, misses nuance)
    2 = Poor (wrong jurisdiction or vague)
    1 = Dangerous (hallucinations or harmful)
    """
    
    prompt = f"""You are an expert evaluator of Kenyan legal Q&A systems.

**Question:** {question}

**Correct Answer (Ground Truth):** {ground_truth}

**Answer to Evaluate:**
{answer}

**Task:** Rate this answer using this rubric (1-5 scale):

**5 (Excellent)**: Accurate Kenyan law + cites specific sections/cases + clear reasoning + no hallucinations
**4 (Good)**: Accurate Kenyan law + correct reasoning, but lacks specific citations or slightly vague
**3 (Acceptable)**: Generally correct but misses nuance OR refers to general common law instead of Kenyan statutes
**2 (Poor)**: Vague OR applies non-Kenyan law (UK/US) to Kenyan context OR omits critical details
**1 (Dangerous)**: Factually incorrect OR hallucinations (fake statutes/cases) OR harmful advice

**CRITICAL:** Respond with ONLY a single number (1, 2, 3, 4, or 5). No explanations, no text."""

    response_text = invoke_with_retry(llm, prompt)
    
    if response_text:
        try:
            # Extract first number found
            import re
            match = re.search(r'[1-5]', response_text.strip())
            if match:
                score = int(match.group())
                return score
        except:
            pass
    
    return 0  # Default if parsing failed

# ============================================================
# Load Data and Evaluate
# ============================================================

# Read manual evaluation template (has Generic Claude answers)
df_manual = pd.read_excel('manual_evaluation_template.xlsx')

print("📊 Evaluating SAC-RAG vs Generic Claude (10 questions)")
print("="*60 + "\n")

sac_scores = []
generic_scores = []

for i, row in df_manual.iterrows():
    print(f"  Q{i+1}/10: {row['Question'][:60]}...")
    
    # Score SAC-RAG
    print(f"    Scoring SAC-RAG...", end=" ")
    sac_score = score_answer_rubric(
        row['Question'],
        row['SAC_RAG_Answer'],
        row['Ground_Truth'],
        llm_generate
    )
    sac_scores.append(sac_score)
    print(f"Score: {sac_score}/5")
    time.sleep(3)
    
    # Score Generic Claude
    print(f"    Scoring Generic Claude...", end=" ")
    generic_score = score_answer_rubric(
        row['Question'],
        row['Generic_Claude_Answer'],
        row['Ground_Truth'],
        llm_generate
    )
    generic_scores.append(generic_score)
    print(f"Score: {generic_score}/5")
    time.sleep(3)
    
    print()

# ============================================================
# Calculate Results
# ============================================================
sac_avg = sum(sac_scores) / len(sac_scores)
generic_avg = sum(generic_scores) / len(generic_scores)

# Count wins
sac_wins = sum(1 for s, g in zip(sac_scores, generic_scores) if s > g)
generic_wins = sum(1 for s, g in zip(sac_scores, generic_scores) if g > s)
ties = sum(1 for s, g in zip(sac_scores, generic_scores) if s == g)

print("\n" + "="*60)
print("📊 AUTOMATED LLM-AS-A-JUDGE RESULTS")
print("="*60 + "\n")

comparison_df = pd.DataFrame({
    "System": ["SAC-RAG", "Generic Claude"],
    "Average Score (/5)": [sac_avg, generic_avg],
    "Wins": [sac_wins, generic_wins],
    "Ties": [ties, ties]
})

print(comparison_df.to_string(index=False))
print(f"\nImprovement: {((sac_avg - generic_avg) / generic_avg * 100):+.2f}%")

# Export results
comparison_df.to_csv("automated_llm_judge_results.csv", index=False)

# Export detailed scores
detailed_df = pd.DataFrame({
    'Question_ID': [f"Q{i+1}" for i in range(len(df_manual))],
    'Question': df_manual['Question'],
    'SAC_RAG_Score': sac_scores,
    'Generic_Claude_Score': generic_scores,
    'Winner': ['SAC-RAG' if s > g else ('Generic Claude' if g > s else 'Tie') 
               for s, g in zip(sac_scores, generic_scores)]
})
detailed_df.to_csv("automated_llm_judge_detailed.csv", index=False)

print("\n✅ Results exported:")
print("   - automated_llm_judge_results.csv (summary)")
print("   - automated_llm_judge_detailed.csv (per-question scores)")

print("\n" + "="*60)
print("💡 COMPARISON: Manual vs Automated Evaluation")
print("="*60)
print("\nManual (Your Blind Evaluation):")
print("  SAC-RAG: 3.6/5")
print("  Generic Claude: 4.9/5")
print("  Winner: Generic Claude (6 wins)")
print(f"\nAutomated (LLM-as-a-Judge):")
print(f"  SAC-RAG: {sac_avg:.2f}/5")
print(f"  Generic Claude: {generic_avg:.2f}/5")
print(f"  Winner: {'SAC-RAG' if sac_wins > generic_wins else 'Generic Claude'} ({max(sac_wins, generic_wins)} wins)")

print("\n✅ Automated evaluation complete! 🎓")

INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



AUTOMATED LLM-AS-A-JUDGE EVALUATION
SAC-RAG vs Generic Claude (Same Rubric: 1-5)

📊 Evaluating SAC-RAG vs Generic Claude (10 questions)

  Q1/10: Summarize the Supreme Court's holding in Torino Enterprises ...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 2/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q2/10: Under Section 29 of the Law of Succession Act (Cap 160), if ...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q3/10: List the specific P&A forms required to file for a grant of ...
    Scoring SAC-RAG... Score: 3/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q4/10: Based on the Sirikwa Squatters case (SC Petition E036 of 202...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 2/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q5/10: In Republic v Kenya Revenue Authority ex parte Vipees, what ...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 3/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q6/10: Explain the concept of 'House without Land' as established i...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q7/10: Find a Kenyan precedent where the Employment and Labour Rela...
    Scoring SAC-RAG... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 2/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q8/10: Does Article 40(3) of the Constitution protect a title deed ...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q9/10: What is the limitation period for filing a claim in tort aga...
    Scoring SAC-RAG... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 2/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q10/10: If a man dies intestate in Kenya, and his only surviving rel...
    Scoring SAC-RAG... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 2/5


📊 AUTOMATED LLM-AS-A-JUDGE RESULTS

        System  Average Score (/5)  Wins  Ties
       SAC-RAG                 4.1     6     1
Generic Claude                 3.1     3     1

Improvement: +32.26%

✅ Results exported:
   - automated_llm_judge_results.csv (summary)
   - automated_llm_judge_detailed.csv (per-question scores)

💡 COMPARISON: Manual vs Automated Evaluation

Manual (Your Blind Evaluation):
  SAC-RAG: 3.6/5
  Generic Claude: 4.9/5
  Winner: Generic Claude (6 wins)

Automated (LLM-as-a-Judge):
  SAC-RAG: 4.10/5
  Generic Claude: 3.10/5
  Winner: SAC-RAG (6 wins)

✅ Automated evaluation complete! 🎓


In [14]:
import pandas as pd
import time
from botocore.exceptions import ClientError

print("\n" + "="*60)
print("AUTOMATED LLM-AS-A-JUDGE EVALUATION - CLAUDE 4.5")
print("SAC-RAG (Claude 4.5) vs Generic Claude (Same Rubric: 1-5)")
print("="*60 + "\n")

# ============================================================
# Helper: Retry Logic
# ============================================================
def invoke_with_retry(llm, prompt, max_retries=5):
    """Invoke LLM with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            response = llm.invoke(prompt)
            # Extract content from AIMessage
            if hasattr(response, 'content'):
                return response.content
            return str(response)
        except ClientError as e:
            if e.response['Error']['Code'] == 'ThrottlingException':
                if attempt < max_retries - 1:
                    wait_time = 5 * (2 ** attempt)  # 5s, 10s, 20s, 40s, 80s
                    print(f"      ⚠️ Throttled. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    print(f"      ❌ Failed after {max_retries} retries")
                    return None
            else:
                raise
        except Exception as e:
            print(f"      ❌ Error: {str(e)[:100]}")
            if attempt < max_retries - 1:
                time.sleep(5)
            else:
                return None
    return None

# ============================================================
# LLM-as-a-Judge Scoring Function (1-5 Scale)
# ============================================================
def score_answer_rubric(question, answer, ground_truth, llm):
    """
    Score answer using same 1-5 rubric as manual evaluation:
    5 = Excellent (Kenyan law + specific sections)
    4 = Good (Kenyan law but less specific)
    3 = Acceptable (general law, misses nuance)
    2 = Poor (wrong jurisdiction or vague)
    1 = Dangerous (hallucinations or harmful)
    """
    
    prompt = f"""You are an expert evaluator of Kenyan legal Q&A systems.

**Question:** {question}

**Correct Answer (Ground Truth):** {ground_truth}

**Answer to Evaluate:**
{answer}

**Task:** Rate this answer using this rubric (1-5 scale):

**5 (Excellent)**: Accurate Kenyan law + cites specific sections/cases + clear reasoning + no hallucinations
**4 (Good)**: Accurate Kenyan law + correct reasoning, but lacks specific citations or slightly vague
**3 (Acceptable)**: Generally correct but misses nuance OR refers to general common law instead of Kenyan statutes
**2 (Poor)**: Vague OR applies non-Kenyan law (UK/US) to Kenyan context OR omits critical details
**1 (Dangerous)**: Factually incorrect OR hallucinations (fake statutes/cases) OR harmful advice

**CRITICAL:** Respond with ONLY a single number (1, 2, 3, 4, or 5). No explanations, no text."""

    response_text = invoke_with_retry(llm, prompt)
    
    if response_text:
        try:
            # Extract first number found
            import re
            match = re.search(r'[1-5]', response_text.strip())
            if match:
                score = int(match.group())
                return score
        except:
            pass
    
    return 0  # Default if parsing failed

# ============================================================
# Load Data and Evaluate
# ============================================================

# Read manual evaluation template (has Generic Claude answers)
df_manual = pd.read_excel('manual_evaluation_template.xlsx')

print("📊 Evaluating SAC-RAG (Claude 4.5) vs Generic Claude (10 questions)")
print("="*60 + "\n")

sac_scores = []
generic_scores = []

for i, row in df_manual.iterrows():
    print(f"  Q{i+1}/10: {row['Question'][:60]}...")
    
    # Score SAC-RAG (Claude 4.5)
    print(f"    Scoring SAC-RAG (Claude 4.5)...", end=" ")
    sac_score = score_answer_rubric(
        row['Question'],
        row['SAC_RAG_Answer'],
        row['Ground_Truth'],
        llm_generate
    )
    sac_scores.append(sac_score)
    print(f"Score: {sac_score}/5")
    time.sleep(3)
    
    # Score Generic Claude
    print(f"    Scoring Generic Claude...", end=" ")
    generic_score = score_answer_rubric(
        row['Question'],
        row['Generic_Claude_Answer'],
        row['Ground_Truth'],
        llm_generate
    )
    generic_scores.append(generic_score)
    print(f"Score: {generic_score}/5")
    time.sleep(3)
    
    print()

# ============================================================
# Calculate Results
# ============================================================
sac_avg = sum(sac_scores) / len(sac_scores)
generic_avg = sum(generic_scores) / len(generic_scores)

# Count wins
sac_wins = sum(1 for s, g in zip(sac_scores, generic_scores) if s > g)
generic_wins = sum(1 for s, g in zip(sac_scores, generic_scores) if g > s)
ties = sum(1 for s, g in zip(sac_scores, generic_scores) if s == g)

print("\n" + "="*60)
print("📊 AUTOMATED LLM-AS-A-JUDGE RESULTS (CLAUDE 4.5)")
print("="*60 + "\n")

comparison_df = pd.DataFrame({
    "System": ["SAC-RAG (Claude 4.5)", "Generic Claude"],
    "Average Score (/5)": [sac_avg, generic_avg],
    "Wins": [sac_wins, generic_wins],
    "Ties": [ties, ties]
})

print(comparison_df.to_string(index=False))
print(f"\nImprovement: {((sac_avg - generic_avg) / generic_avg * 100):+.2f}%")

# Export results
comparison_df.to_csv("automated_llm_judge_results_claude45.csv", index=False)

# Export detailed scores
detailed_df = pd.DataFrame({
    'Question_ID': [f"Q{i+1}" for i in range(len(df_manual))],
    'Question': df_manual['Question'],
    'SAC_RAG_Score': sac_scores,
    'Generic_Claude_Score': generic_scores,
    'Winner': ['SAC-RAG' if s > g else ('Generic Claude' if g > s else 'Tie') 
               for s, g in zip(sac_scores, generic_scores)]
})
detailed_df.to_csv("automated_llm_judge_detailed_claude45.csv", index=False)

print("\n✅ Results exported:")
print("   - automated_llm_judge_results_claude45.csv (summary)")
print("   - automated_llm_judge_detailed_claude45.csv (per-question scores)")

print("\n" + "="*60)
print("💡 COMPARISON: Manual vs Automated Evaluation (CLAUDE 4.5)")
print("="*60)
print("\nManual (Your Blind Evaluation - Claude 4.5):")
print("  SAC-RAG (Claude 4.5): 4.70/5")
print("  Generic Claude: 4.80/5")
print("  Winner: Generic Claude (3 wins, 2 SAC-RAG wins, 5 ties)")
print(f"\nAutomated (LLM-as-a-Judge - Claude 4.5):")
print(f"  SAC-RAG (Claude 4.5): {sac_avg:.2f}/5")
print(f"  Generic Claude: {generic_avg:.2f}/5")
print(f"  Winner: {'SAC-RAG' if sac_wins > generic_wins else ('Generic Claude' if generic_wins > sac_wins else 'Tie')} ({max(sac_wins, generic_wins)} wins, {ties} ties)")

print("\n" + "="*60)
print("📊 CLAUDE 3.5 vs CLAUDE 4.5 COMPARISON")
print("="*60)
print("\nClaude 3.5 Results (Manual):")
print("  SAC-RAG (Claude 3.5): 3.60/5")
print("  Generic Claude: 4.90/5")
print("  Difference: -1.30 (-36.1%)")
print("\nClaude 4.5 Results (Manual):")
print("  SAC-RAG (Claude 4.5): 4.70/5")
print("  Generic Claude: 4.80/5")
print("  Difference: -0.10 (-2.1%)")
print("\n🎯 KEY FINDING:")
print("  Upgrading SAC-RAG from Claude 3.5 to 4.5 improved performance by +30.6%")
print("  (from 3.60 to 4.70), nearly closing the gap with Generic Claude!")

print("\n✅ Automated evaluation complete! 🎓")


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



AUTOMATED LLM-AS-A-JUDGE EVALUATION - CLAUDE 4.5
SAC-RAG (Claude 4.5) vs Generic Claude (Same Rubric: 1-5)

📊 Evaluating SAC-RAG (Claude 4.5) vs Generic Claude (10 questions)

  Q1/10: Summarize the Supreme Court's holding in Torino Enterprises ...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q2/10: Under Section 29 of the Law of Succession Act (Cap 160), if ...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q3/10: List the specific P&A forms required to file for a grant of ...
    Scoring SAC-RAG (Claude 4.5)... Score: 3/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q4/10: Based on the Sirikwa Squatters case (SC Petition E036 of 202...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q5/10: In Republic v Kenya Revenue Authority ex parte Vipees, what ...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q6/10: Explain the concept of 'House without Land' as established i...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q7/10: Find a Kenyan precedent where the Employment and Labour Rela...
    Scoring SAC-RAG (Claude 4.5)... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q8/10: Does Article 40(3) of the Constitution protect a title deed ...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q9/10: What is the limitation period for filing a claim in tort aga...
    Scoring SAC-RAG (Claude 4.5)... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response



  Q10/10: If a man dies intestate in Kenya, and his only surviving rel...
    Scoring SAC-RAG (Claude 4.5)... Score: 4/5


INFO:langchain_aws.llms.bedrock:Using Bedrock Invoke API to generate response


    Scoring Generic Claude... Score: 5/5


📊 AUTOMATED LLM-AS-A-JUDGE RESULTS (CLAUDE 4.5)

              System  Average Score (/5)  Wins  Ties
SAC-RAG (Claude 4.5)                 4.1     1     3
      Generic Claude                 4.6     6     3

Improvement: -10.87%

✅ Results exported:
   - automated_llm_judge_results_claude45.csv (summary)
   - automated_llm_judge_detailed_claude45.csv (per-question scores)

💡 COMPARISON: Manual vs Automated Evaluation (CLAUDE 4.5)

Manual (Your Blind Evaluation - Claude 4.5):
  SAC-RAG (Claude 4.5): 4.70/5
  Generic Claude: 4.80/5
  Winner: Generic Claude (3 wins, 2 SAC-RAG wins, 5 ties)

Automated (LLM-as-a-Judge - Claude 4.5):
  SAC-RAG (Claude 4.5): 4.10/5
  Generic Claude: 4.60/5
  Winner: Generic Claude (6 wins, 3 ties)

📊 CLAUDE 3.5 vs CLAUDE 4.5 COMPARISON

Claude 3.5 Results (Manual):
  SAC-RAG (Claude 3.5): 3.60/5
  Generic Claude: 4.90/5
  Difference: -1.30 (-36.1%)

Claude 4.5 Results (Manual):
  SAC-RAG (Claude 4.5): 4.70/5
  Gener