# Quilter Senior AI Scientist Challenge – Advisor Assistant

Kai Glahome
21/02/2026

Design decisions, evaluation strategy, and reflections on further improvements are covered in the accompanying documentation.


To run:

- Place the source Quilter PDF documents in a folder named `quilter_pdfs_advisor_support_material` at the project root.
- Add your OpenAI API key in a `.env` file at the project root:  
  `OPENAI_API_KEY=your_api_key_here`
- Run the notebook from the start. The notebook will:
  - Ingest and process the PDFs
  - Build the Chroma vector database automatically
  - Provide the interactive assistant interface

I ran on Python 3.12.3

In [1]:
# dependencies

!pip install -r requirements.txt



In [2]:
# import + config

import os
import re
import hashlib
import json
import pymupdf
import numpy as np
from pathlib import Path
from dotenv import load_dotenv
from rank_bm25 import BM25Okapi
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import chromadb

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PDF_DIR = Path("quilter pdfs - advisor support material")
CHROMA_DIR = "./chroma_db"
COLLECTION_NAME = "quilter_docs"

print("Config loaded.")
print(f"PDFs found: {len(list(PDF_DIR.glob('*.pdf')))}")

Config loaded.
PDFs found: 89


## PDF Extraction

PDFs are extracted using PyMuPDF, chosen for its preservation of block-level layout structure - important for Quilter documents which mix tables, bullet points, and flowing prose.  

A boilerplate stripping pass removes recurring noise (page numbers, domain footers) that would otherwise inflate similarity scores between unrelated chunks.

In [3]:
# pdf extraction
"""def extract_pages(pdf_path):
    doc = pymupdf.open(pdf_path)
    pages = []
    for i, page in enumerate(doc):
        text = page.get_text("text").strip()
        pages.append({
            "source": Path(pdf_path).name,
            "page": i + 1,
            "text": text
        })
    doc.close()
    return pages

# boiler plate removal
def strip_boilerplate(text):
    lines = text.split("\n")
    cleaned = [
        line for line in lines
        if not re.match(r'^\s*\d+\s*$', line) # lone page numbers
        and "quilter.com" not in line.lower()
        and "quilter plc" not in line.lower()
        and len(line.strip()) > 2
    ]
    return "\n".join(cleaned)

all_pages = []
for pdf in sorted(PDF_DIR.glob("*.pdf")):
    pages = extract_pages(pdf)
    for p in pages:
        p["text"] = strip_boilerplate(p["text"])
    all_pages.extend(pages)

print(f"Extracted {len(all_pages)} pages from {len(list(PDF_DIR.glob('*.pdf')))} PDFs")"""

import pdfplumber

# pdf extraction
def extract_pages(pdf_path):
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            pages.append({
                "source": Path(pdf_path).name,
                "page": i + 1,
                "text": text.strip()
            })
    return pages

# boiler plate removal
def strip_boilerplate(text):
    """Remove common Quilter header/footer noise."""
    lines = text.split("\n")
    cleaned = [
        line for line in lines
        if not re.match(r'^\s*\d+\s*$', line)  # lone page numbers
        and "quilter.com" not in line.lower()
        and "quilter plc" not in line.lower()
        and len(line.strip()) > 2
    ]
    return "\n".join(cleaned)

all_pages = []
for pdf in sorted(PDF_DIR.glob("*.pdf")):
    pages = extract_pages(pdf)
    for p in pages:
        p["text"] = strip_boilerplate(p["text"])
    all_pages.extend(pages)

print(f"Extracted {len(all_pages)} pages from {len(list(PDF_DIR.glob('*.pdf')))} PDFs")

  """def extract_pages(pdf_path):


Extracted 482 pages from 89 PDFs


## Chunking - Context Aware Strategy

A two-pass context-aware strategy is applied over naive fixed-size splitting when chunking the PDFs.

Pass 1 detects section boundaries using structural cues.

Pass 2 applies RecursiveCharacterTextSplitter only to sections exceeding 1500 characters, splitting on paragraph breaks first, then if that fails, then sentences, then words.

Every chunk is prefixed with its section heading to give the embedding model necessary local context. 

Each chunk carries an MD5 content hash serving two purposes: deduplication of boilerplate repeated across documents, and incremental re-indexing. 

In [4]:
def extract_version_from_filename(filename):
    match = re.search(r'[_\-\s]v(\d+[\.\d]*)', filename, re.IGNORECASE)
    if match:
        return match.group(1)
    match = re.search(r'[_\-\s](\d{4})', filename)
    if match:
        return match.group(1)
    return "unknown"

def is_heading(line):
    line = line.strip()
    if not line or len(line) > 120:
        return False
    if line.isupper() and len(line) > 3:
        return True
    if re.match(r'^(\d+\.)+\s+[A-Z]', line):
        return True
    return False

def split_into_sections(pages):
    sections = []
    current = {"heading": "Introduction", "text": "", "pages": [], "source": None}
    for page in pages:
        current["source"] = page["source"]
        for line in page["text"].split("\n"):
            if is_heading(line):
                if current["text"].strip():
                    sections.append(current.copy())
                current = {
                    "heading": line.strip(),
                    "text": "",
                    "pages": [page["page"]],
                    "source": page["source"]
                }
            else:
                current["text"] += line + "\n"
                if page["page"] not in current["pages"]:
                    current["pages"].append(page["page"])
    if current["text"].strip():
        sections.append(current)
    return sections

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " "]
)

def make_chunk(text, source, pages, heading, chunk_index=0):
    return {
        "text": text,
        "source": source,
        "pages": pages,
        "heading": heading,
        "chunk_index": chunk_index,
        "content_hash": hashlib.md5(text.encode()).hexdigest(),
        "doc_version": extract_version_from_filename(source)
    }

def chunk_sections(sections):
    chunks = []
    for section in sections:
        full_text = f"{section['heading']}\n\n{section['text']}".strip()
        if len(section["text"]) < 1500:
            chunks.append(make_chunk(full_text, section["source"], section["pages"], section["heading"]))
        else:
            for i, sub in enumerate(splitter.split_text(section["text"])):
                sub_text = f"{section['heading']}\n\n{sub}".strip()
                chunks.append(make_chunk(sub_text, section["source"], section["pages"], section["heading"], i))
    return chunks

# Run it
all_chunks = []
for pdf in sorted(PDF_DIR.glob("*.pdf")):
    pages = extract_pages(pdf)
    for p in pages:
        p["text"] = strip_boilerplate(p["text"])
    sections = split_into_sections(pages)
    all_chunks.extend(chunk_sections(sections))

# Deduplicate
seen = {}
deduped_chunks = []
for chunk in all_chunks:
    if chunk["content_hash"] not in seen:
        seen[chunk["content_hash"]] = True
        deduped_chunks.append(chunk)

print(f"Total chunks before dedup: {len(all_chunks)}")
print(f"Total chunks after dedup:  {len(deduped_chunks)}")
print(f"Avg chunk length: {sum(len(c['text']) for c in deduped_chunks) / len(deduped_chunks):.0f} chars")

Total chunks before dedup: 1452
Total chunks after dedup:  1422
Avg chunk length: 885 chars


## Embedding

A simple ChromaDB for a local persistent vector database.

In [5]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", openai_api_key=OPENAI_API_KEY)

texts = [c["text"] for c in deduped_chunks]
metadatas = [
    {
        "source": c["source"],
        "pages": str(c["pages"]),
        "heading": c["heading"],
        "chunk_index": c["chunk_index"],
        "content_hash": c["content_hash"],
        "doc_version": c["doc_version"]
    }
    for c in deduped_chunks
]
vectorstore = Chroma.from_texts(
    texts=texts,
    embedding=embeddings,
    metadatas=metadatas,
    persist_directory=CHROMA_DIR,
    collection_name=COLLECTION_NAME
)
print(f"Indexed {len(texts)} chunks into ChromaDB at {CHROMA_DIR}")

Indexed 1422 chunks into ChromaDB at ./chroma_db


## Retrieval - Hybrid

Combination of dense vector search with sparse BM25. Results are merged using RFF rewarding documents ranked highly by both retrievers. Retrieval count defaults to a fixed k=6.

In [6]:
# BM25 sparse retriever
tokenized_corpus = [c["text"].lower().split() for c in deduped_chunks]
bm25 = BM25Okapi(tokenized_corpus)


def hybrid_retrieve(query, k=6):
    """Combine dense (ChromaDB) and sparse (BM25) retrieval with RRF."""
    
    # Dense retrieval
    dense_results = vectorstore.similarity_search_with_score(query, k=k*2)
    dense_ranked = {r[0].page_content: i for i, r in enumerate(dense_results)}
    
    # Sparse BM25 retrieval
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_top_indices = np.argsort(bm25_scores)[::-1][:k*2]
    bm25_ranked = {deduped_chunks[i]["text"]: rank for rank, i in enumerate(bm25_top_indices)}
    
    # Reciprocal Rank Fusion
    all_texts = set(dense_ranked.keys()) | set(bm25_ranked.keys())
    rrf_scores = {}
    for text in all_texts:
        dense_rank = dense_ranked.get(text, k*2)
        bm25_rank = bm25_ranked.get(text, k*2)
        rrf_scores[text] = 1/(60 + dense_rank) + 1/(60 + bm25_rank)
    
    top_texts = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:k]
    
    # Fetch full chunk metadata for top results
    results = []
    text_to_chunk = {c["text"]: c for c in deduped_chunks}
    for text in top_texts:
        if text in text_to_chunk:
            results.append(text_to_chunk[text])
    
    return results



print("Hybrid retriever ready.")

Hybrid retriever ready.


## Answer Generation &  Guardrail

Strict prompt, temperature at 0, fallback can be triggered with the first response or via checkfaithfulness gaurdrail.

In [7]:
# answer generation with guardrails

from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)

SYSTEM_PROMPT = """You are an assistant that helps Quilter financial advisers answer operational questions.

STRICT RULES:
1. Answer ONLY using the context provided below. Do not use any outside knowledge.
2. If the context does not contain enough information to answer the question, respond with exactly:
   "Please reach out to the Contact Centre."
3. Always cite your sources using the format [Source: filename, p.X] at the end of each point.
4. Do not give financial advice. Only explain processes and operational steps.
5. Be concise and structured. Use bullet points for multi-step processes.
"""

def check_faithfulness(answer, context_texts):
    """Ask GPT to verify the answer is grounded in the context."""
    context_combined = "\n\n".join(context_texts[:4])
    check_prompt = f"""Given this context:
{context_combined}

And this answer:
{answer}

Does every factual claim in the answer appear in the context? 
Reply with JSON only: {{"faithful": true/false, "reason": "brief explanation"}}"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheaper model for this check
        messages=[{"role": "user", "content": check_prompt}],
        temperature=0
    )
    try:
        return json.loads(response.choices[0].message.content)
    except:
        return {"faithful": True, "reason": "Could not parse faithfulness check"}


def answer_query(query, k=6, check_grounding=True):
    # Retrieve relevant chunks
    retrieved = hybrid_retrieve(query, k=k)
    
    if not retrieved:
        return {
            "answer": "Please reach out to the Contact Centre.",
            "sources": [],
            "faithful": None,
            "chunks_used": 0
        }
    
    # Build context string
    context_parts = []
    for chunk in retrieved:
        source_label = f"[Source: {chunk['source']}, p.{chunk['pages']}]"
        context_parts.append(f"{source_label}\n{chunk['text']}")
    context = "\n\n---\n\n".join(context_parts)
    
    # Generate answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        temperature=0
    )
    answer = response.choices[0].message.content
    
    # Faithfulness check
    faithfulness = None
    if check_grounding:
        faithfulness = check_faithfulness(answer, [c["text"] for c in retrieved])
        if not faithfulness["faithful"]:
            answer = "Please reach out to the Contact Centre."
    
    return {
        "answer": answer,
        "sources": [{"source": c["source"], "pages": c["pages"], "heading": c["heading"]} for c in retrieved],
        "faithful": faithfulness,
        "chunks_used": len(retrieved)
    }

    
print("Answer generation ready.")

Answer generation ready.


## Local FastAPI Endpoint

Two routes are provided: /health for server status and /ask for query submission, with configurable parameters for retrieval depth and guardrail toggling.

In [8]:
# FastAPI app

import nest_asyncio
import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import threading

nest_asyncio.apply()  # allows FastAPI to run inside a Jupyter notebook

app = FastAPI(title="Quilter Advisor Assistant")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

class QueryRequest(BaseModel):
    query: str
    k: int = 6
    check_grounding: bool = True


class QueryResponse(BaseModel):
    answer: str
    sources: list
    faithful: dict | None
    chunks_used: int

@app.get("/health")
def health():
    return {"status": "ok", "chunks_indexed": len(deduped_chunks)}

@app.post("/ask", response_model=QueryResponse)
def ask(request: QueryRequest):
    result = answer_query(request.query, k=request.k, check_grounding=request.check_grounding)
    return result

# Run in background thread so notebook stays interactive
def run_server():
    uvicorn.run(app, host="0.0.0.0", port=8000, log_level="warning")

thread = threading.Thread(target=run_server, daemon=True)
thread.start()

print("FastAPI running at http://localhost:8000")
print("Docs at http://localhost:8000/docs")

FastAPI running at http://localhost:8000
Docs at http://localhost:8000/docs


In [9]:
import time
time.sleep(2)  # give the thread a moment to start

import requests
try:
    r = requests.get("http://localhost:8000/health", timeout=3)
    print("Server is up:", r.json())
except Exception as e:
    print("Server not running:", e)
    print("\nCheck that Cell 8 ran without errors.")

Server is up: {'status': 'ok', 'chunks_indexed': 1422}


Test a single question and answer...

In [10]:
import requests


# Test the health endpoint
print(requests.get("http://localhost:8000/health").json())

# Test a query
response = requests.post("http://localhost:8000/ask", json={
    "query": "What are the requirements to complete an ISA transfer?",
    "k": 6,
    "check_grounding": True
})

result = response.json()
print("\n=== ANSWER ===")
print(result["answer"])
print("\n=== SOURCES ===")
for s in result["sources"]:
    print(f"  - {s['source']} | {s['heading']} | pp.{s['pages']}")
print("\n=== GROUNDING CHECK ===")
print(result["faithful"])




{'status': 'ok', 'chunks_indexed': 1422}

=== ANSWER ===
Please reach out to the Contact Centre.

=== SOURCES ===
  - 18152-isa-aps-options--for-deaths-before-6-april-2018.pdf | Introduction | pp.[1]
  - accepting_documentation_via_email_and_prompt.pdf | Introduction | pp.[1, 2, 3]
  - 18051-isa-aps-options-after-6-april.pdf | Introduction | pp.[1]
  - 18051-isa-aps-options-after-6-april.pdf | 2. If the deceased had an ISA with another ISA manager, the options are: | pp.[3]
  - 23625-release-of-information-request.pdf | Introduction | pp.[1]
  - 18051-isa-aps-options-after-6-april.pdf | YES NO | pp.[1]

=== GROUNDING CHECK ===
{'faithful': False, 'reason': 'The answer does not provide any specific information or context related to the Additional Permitted Subscription (APS) or the eligibility criteria mentioned in the provided context.'}


# Offline Evaluation

## Annotated Evaluation Dataset

11 manually created Question answer pairs with source documentation and location within the source.
5 out of scope questions specifically for refusal rate testing.

In [11]:
EVAL_SET = [
    {
        "question": "What is Quilter's Absolute Trust?",
        "expected_answer": "A simple IHT solution where the client does not require access to the capital, knows who they want to leave their wealth to, and requires no future flexibility.",
        "source_pdf": "20881-understanding-our-range-of-trusts.pdf",
        "source_pages": [1],
        "type": "in-scope"
    },
    {
        "question": "When is Life Fund Tax Charge Taken?",
        "expected_answer": "The charge is taken at numerous points through the year.",
        "source_pdf": "7910-taxation-for-quilter-life-funds.pdf",
        "source_pages": [4],
        "type": "in-scope"
    },
    {
        "question": "Who is the Quilter Smoothed Balanced Fund (Standard Life) a suitable choice for?",
        "expected_answer": "The funds may be suitable for an investor who is approaching or is in retirement, wants to reduce day-to-day fluctuations, is looking to potentially grow their investment, is willing to accept some risk, intends to invest for at least five years, wants to take an income, and has a financial adviser.",
        "source_pdf": "kiid-gb00bt9lzb27-en-gb.pdf",
        "source_pages": [2],
        "type": "in-scope"
    },
    {
        "question": "What are investment pathways?",
        "expected_answer": "Instead of having to choose an investment for your drawdown pot, you choose from four options that closely match what you would like to do with your money in the next 5 years.",
        "source_pdf": "20993-what-you-need-to-know-about-investment-pathways.pdf",
        "source_pages": [3],
        "type": "in-scope"
    },
    {
        "question": "What are share identification rules?",
        "expected_answer": "Rules that ended bed and breakfasting by matching disposed units first with units acquired the same day, then units acquired in the following 30 days, then units in the Section 104 holding.",
        "source_pdf": "20720-cgt-quick-reference-guide-3-share-identification-rules.pdf",
        "source_pages": [1],
        "type": "in-scope"
    },
    {
        "question": "In the Legal Framework what is point 6.1.2?",
        "expected_answer": "The Intermediary will be responsible for ensuring that only permitted individuals access and use the Services, and will be liable for any acts or omissions resulting from use of User Access by any of its Users.",
        "source_pdf": "7161_legal_framework.pdf",
        "source_pages": [3],
        "type": "in-scope"
    },
    {
        "question": "In the Legal Framework, what is the policy on Third Party suppliers?",
        "expected_answer": "Third party providers may require an Intermediary or User to agree to additional terms for use of their software or services, without prejudice to the obligations of the Parties under the Agreement.",
        "source_pdf": "7161_legal_framework.pdf",
        "source_pages": [5],
        "type": "in-scope"
    },
    {
        "question": "What is an Additional Permitted Subscription (APS)?",
        "expected_answer": "When an ISA investor dies on or after 3 December 2014, a surviving spouse is entitled to invest into an ISA over and above the annual ISA allowance, known as an Additional Permitted Subscription or APS.",
        "source_pdf": "18152-isa-aps-options--for-deaths-before-6-april-2018.pdf",
        "source_pages": [1],
        "type": "in-scope"
    },
    {
        "question": "How can you complete the Intermediary Resignation Form?",
        "expected_answer": "Either electronically by saving and opening in Adobe Acrobat to complete editable fields then signing, or by hand by printing and completing in block capitals using blue or black ink.",
        "source_pdf": "19502-intermediary-registration-form.pdf",
        "source_pages": [1],
        "type": "in-scope"
    },
    {
        "question": "What are the objectives of the WealthSelect Sustainable Portfolios?",
        "expected_answer": "To achieve capital growth over five or more years whilst supporting sustainable solutions to environmental and social challenges aligned with UN Sustainable Development Goals, managing ESG risks and maintaining a smaller carbon footprint than the MSCI ACWI reference index.",
        "source_pdf": "11541_wealthselect_due_dilligence.pdf",
        "source_pages": [5],
        "type": "in-scope"
    },
    {
        "question": "What is reason number 1 for using onshore bonds?",
        "expected_answer": "Tax deferral and control — clients are only assessable for tax when a chargeable event occurs, such as withdrawals over the 5% allowance, meaning they control when they report and pay tax.",
        "source_pdf": "23416_six_reasons_why.pdf",
        "source_pages": [1],
        "type": "in-scope"
    },
    {
        "question": "How can I make money from Bitcoin?",
        "expected_answer": None,
        "source_pdf": None,
        "source_pages": None,
        "type": "out-of-scope"
    },
    {
        "question": "What is the weather doing?",
        "expected_answer": None,
        "source_pdf": None,
        "source_pages": None,
        "type": "out-of-scope"
    },
    {
        "question": "Who is Kai Glahome?",
        "expected_answer": None,
        "source_pdf": None,
        "source_pages": None,
        "type": "out-of-scope"
    },
    {
        "question": "How do I increase my expected returns?",
        "expected_answer": None,
        "source_pdf": None,
        "source_pages": None,
        "type": "out-of-scope"
    },
    {
        "question": "Where is London?",
        "expected_answer": None,
        "source_pdf": None,
        "source_pages": None,
        "type": "out-of-scope"
    }
]

## Running Offline Evaluation

Offline Evaluation Metrics

Subcategorised into retrieval, generation, and system... Discussed in the documentation why.

**Retrieval** (first 3 derived from Recall@6)

- **Source PDF Recall@6** — For each in-scope question, checks whether the correct source document appeared within the top 6 retrieved chunks. A score of 1.0 indicates perfect retrieval coverage.
- **Page Recall@6** — For each in-scope question, checks whether the correct page appeared within the top 6 retrieved chunks. A score of 1.0 indicates perfect page-level retrieval coverage.
- **Avg Source Rank** — Where the correct document was retrieved, records its position in the results list. Lower is better — a rank of 1 means the correct document was the top result.
- **Avg Redundancy Score** — Measures the proportion of retrieved chunks that came from distinct source documents. A score of 1.0 means all 6 chunks came from different documents; a lower score indicates multiple chunks from the same source.

**Generation**

- **Avg Correctness Score (LLM Judge)** — GPT-4o-mini judges each generated answer against the hand-crafted expected answer on a 1-3 scale: 1 = incorrect, 2 = partially correct, 3 = correct and complete.
- **Avg Groundedness Score (LLM Judge)** — A separate LLM judge assesses whether the generated answer is grounded in the retrieved sources rather than outside knowledge, on a 1-3 scale.

**System**

- **Faithful Answers (LLM Judge)** — Count of in-scope answers that passed the inference-time faithfulness check. Failed answers are suppressed and replaced with the Contact Centre fallback.
- **Refusal Accuracy** — Proportion of out-of-scope questions correctly deflected to the Contact Centre fallback.

In [12]:
import requests
import json
from typing import Optional

def run_offline_evaluation(eval_set):
    results = []
    
    for item in eval_set:
        response = requests.post("http://localhost:8000/ask", json={
            "query": item["question"],
            "k": 6,
            "check_grounding": True
        })
        result = response.json()
        
        answer = result["answer"]
        sources = result["sources"]
        faithful = result["faithful"]
        retrieved_pdfs = [s["source"] for s in sources]
        retrieved_pages = [p for s in sources for p in (eval(s["pages"]) if isinstance(s["pages"], str) else s["pages"])]
        
        # --- Retrieval check (in-scope only) ---
        source_hit = None
        source_rank = None
        page_hit = None
        if item["type"] == "in-scope":
            source_hit = item["source_pdf"] in retrieved_pdfs
            if source_hit:
                source_rank = next(i+1 for i, pdf in enumerate(retrieved_pdfs) if pdf == item["source_pdf"])
            page_hit = any(p in retrieved_pages for p in item["source_pages"])

        # --- Retrieval redundancy (in-scope only) ---
        redundancy_score = None
        if item["type"] == "in-scope":
            distinct_pdfs = len(set(retrieved_pdfs))
            redundancy_score = round(distinct_pdfs / len(retrieved_pdfs), 2)

        # --- Refusal check (out-of-scope only) ---
        correctly_refused = None
        if item["type"] == "out-of-scope":
            correctly_refused = "contact centre" in answer.lower()

        # --- LLM-as-judge answer correctness (in-scope only) ---
        llm_score = None
        llm_reasoning = None
        if item["type"] == "in-scope":
            judge_prompt = f"""You are evaluating an AI assistant's answer against a known correct answer.

Question: {item["question"]}
Expected Answer: {item["expected_answer"]}
Generated Answer: {answer}

Score the generated answer on a scale of 1-3:
1 = Incorrect or missing key information
2 = Partially correct, captures some but not all key points
3 = Correct and complete

Respond with JSON only: {{"score": <1-3>, "reasoning": "<brief explanation>"}}"""

            judge_response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": judge_prompt}],
                temperature=0
            )
            try:
                judge_result = json.loads(judge_response.choices[0].message.content)
                llm_score = judge_result["score"]
                llm_reasoning = judge_result["reasoning"]
            except:
                llm_score = None
                llm_reasoning = "Parse error"

        # --- LLM-as-judge groundedness (in-scope only) ---
        groundedness_score = None
        groundedness_reasoning = None
        if item["type"] == "in-scope":
            groundedness_prompt = f"""You are evaluating whether an AI assistant's answer is grounded in the retrieved documents.

Retrieved Sources: {[s['source'] for s in sources]}
Generated Answer: {answer}

Does the answer contain only information that could plausibly come from the retrieved sources, or does it appear to include invented or outside knowledge?

Score on a scale of 1-3:
1 = Answer contains claims not supported by retrieved sources
2 = Answer is mostly grounded but contains some unsupported claims
3 = Answer is fully grounded in the retrieved sources

Respond with JSON only: {{"score": <1-3>, "reasoning": "<brief explanation>"}}"""

            groundedness_response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": groundedness_prompt}],
                temperature=0
            )
            try:
                groundedness_result = json.loads(groundedness_response.choices[0].message.content)
                groundedness_score = groundedness_result["score"]
                groundedness_reasoning = groundedness_result["reasoning"]
            except:
                groundedness_score = None
                groundedness_reasoning = "Parse error"

        results.append({
            "question": item["question"],
            "type": item["type"],
            "answer_preview": answer[:200],
            "source_hit": source_hit,
            "source_rank": source_rank,
            "page_hit": page_hit,
            "redundancy_score": redundancy_score,
            "correctly_refused": correctly_refused,
            "llm_score": llm_score,
            "llm_reasoning": llm_reasoning,
            "groundedness_score": groundedness_score,
            "groundedness_reasoning": groundedness_reasoning,
            "faithful": faithful
        })

    return results



In [13]:
# Run it
results = run_offline_evaluation(EVAL_SET)


## Results and Interpretation

In [14]:
import pandas as pd

def print_evaluation_report(results):
    in_scope = [r for r in results if r["type"] == "in-scope"]
    out_scope = [r for r in results if r["type"] == "out-of-scope"]

    # Compute metrics
    recall = sum(1 for r in in_scope if r["source_hit"]) / len(in_scope)
    avg_rank = sum(r["source_rank"] for r in in_scope if r["source_rank"]) / max(sum(1 for r in in_scope if r["source_rank"]), 1)
    page_recall = sum(1 for r in in_scope if r["page_hit"]) / len(in_scope)
    avg_redundancy = sum(r["redundancy_score"] for r in in_scope if r["redundancy_score"] is not None) / len(in_scope)
    scored = [r for r in in_scope if r["llm_score"]]
    avg_llm_score = sum(r["llm_score"] for r in scored) / max(len(scored), 1)
    scored_grounded = [r for r in in_scope if r["groundedness_score"]]
    avg_groundedness = sum(r["groundedness_score"] for r in scored_grounded) / max(len(scored_grounded), 1)
    faithful_count = sum(1 for r in in_scope if r["faithful"] and r["faithful"].get("faithful"))
    refusal_accuracy = sum(1 for r in out_scope if r["correctly_refused"]) / len(out_scope)

    # Summary table
    print("=" * 60)
    print("OFFLINE EVALUATION REPORT")
    print("=" * 60)

    summary_data = {
        "Metric": [
            "Source PDF Recall@6",
            "Page Recall@6",
            "Avg Source Rank (lower = better)",
            "Avg Redundancy Score (1.0 = fully diverse)",
            "Avg Correctness Score (LLM Judge) /3",
            "Avg Groundedness Score (LLM Judge) /3",
            f"Faithful Answers",
            "Refusal Accuracy (out-of-scope)"
        ],
        "Value": [
            f"{recall:.2f}",
            f"{page_recall:.2f}",
            f"{avg_rank:.1f}",
            f"{avg_redundancy:.2f}",
            f"{avg_llm_score:.2f}",
            f"{avg_groundedness:.2f}",
            f"{faithful_count}/{len(in_scope)}",
            f"{refusal_accuracy:.2f}"
        ]
    }
    print(pd.DataFrame(summary_data).to_string(index=False))

    # Per question breakdown
    print("\n" + "=" * 60)
    print("PER QUESTION BREAKDOWN")
    print("=" * 60)

    for r in results:
        print(f"\n{'─' * 60}")
        print(f"Q:  {r['question'][:80]}")
        print(f"    Type: {r['type'].upper()}")
        if r["type"] == "in-scope":
            print(f"    Source Hit:   {'✓' if r['source_hit'] else '✗'} (rank {r['source_rank']})")
            print(f"    Page Hit:     {'✓' if r['page_hit'] else '✗'}")
            print(f"    Redundancy:   {r['redundancy_score']}")
            print(f"    Correctness:  {r['llm_score']}/3 — {r['llm_reasoning']}")
            print(f"    Groundedness: {r['groundedness_score']}/3 — {r['groundedness_reasoning']}")
            print(f"    Faithful:     {'✓' if r['faithful'] and r['faithful'].get('faithful') else '✗'}")
        else:
            print(f"    Refused:      {'✓' if r['correctly_refused'] else '✗'}")
        print(f"    Preview:      {r['answer_preview'][:150]}")

In [15]:
print_evaluation_report(results)

OFFLINE EVALUATION REPORT
                                    Metric Value
                       Source PDF Recall@6  0.91
                             Page Recall@6  0.91
          Avg Source Rank (lower = better)   1.4
Avg Redundancy Score (1.0 = fully diverse)  0.47
      Avg Correctness Score (LLM Judge) /3  1.82
     Avg Groundedness Score (LLM Judge) /3  2.45
                          Faithful Answers  7/11
           Refusal Accuracy (out-of-scope)  1.00

PER QUESTION BREAKDOWN

────────────────────────────────────────────────────────────
Q:  What is Quilter's Absolute Trust?
    Type: IN-SCOPE
    Source Hit:   ✗ (rank None)
    Page Hit:     ✓
    Redundancy:   1.0
    Correctness:  1/3 — The generated answer does not provide any information about Quilter's Absolute Trust and instead directs the user to contact a center, missing all key points of the expected answer.
    Groundedness: 1/3 — The answer 'Please reach out to the Contact Centre' does not appear to be supported by