# Part 4: The Showdown (Evaluation Arena)

**Objective:** Compare "The Librarian" (RAG System) vs. "The Intern" (Fine-Tuned Model) on the Golden Test Set.

**Metrics:**
1.  **ROUGE-L**: Measures text overlap (Precision, Recall, F1) against Ground Truth.
2.  **LLM-as-a-Judge**: Uses a superior model (e.g., GPT-4o) to score answer quality (1-5) and reasoning.
3.  **Latency**: Time taken to generate the answer.
4.  **Cost**: Estimated cost per 1k queries.

**Note:** Since the fine-tuning step was skipped/mocked, "The Intern" evaluation will use the **Base Model** (Llama-3-8B) as a proxy to demonstrate the pipeline.

In [1]:
# 1. Install Dependencies (if needed)
# !pip install -q rouge_score weave weaviate-client langchain langchain-community langchain-huggingface sentence-transformers

In [2]:
# 2. Imports & Configuration
import os
import json
import time
import yaml
import pandas as pd
import weaviate
from rouge_score import rouge_scorer

# Add src to path
import sys
sys.path.append(os.path.abspath("../src"))

from services.llm_services import get_llm
from utils.cost_tracker import get_token_count, PRICING

# Load Config (safe)
config = {}
try:
    with open("../src/config/config.yaml", "r") as f:
        config = yaml.safe_load(f) or {}
except FileNotFoundError:
    print("Warning: ../src/config/config.yaml not found. Using defaults.")
except Exception as e:
    print(f"Warning: Error loading config: {e}. Using defaults.")

# Paths with fallbacks to avoid KeyError if 'data' is missing in config
data_cfg = config.get("data", {})
GOLDEN_SET_PATH = data_cfg.get("golden_test_set_path", '../data/processed/golden_test_set.jsonl')
RESULTS_PATH = data_cfg.get("eval_results_path", '../data/results/rag_evaluation_results.json')

print(f"Config Loaded (partial). Testing on: {GOLDEN_SET_PATH}")

Config Loaded (partial). Testing on: ../data/processed/golden_test_set.jsonl


## 3. Metric Functions

In [3]:
def calculate_rouge(reference, candidate):
    """Calculates ROUGE-L score."""
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return scores['rougeL'].fmeasure

def calculate_cost(input_text, output_text, model_name):
    """Estimates cost based on token counts and PRICING."""
    in_tokens = get_token_count(input_text)
    out_tokens = get_token_count(output_text)
    
    # Normalize model name key (handle variations)
    pricing = PRICING.get(model_name, PRICING.get("openai/gpt-4o-mini"))
    
    cost = (in_tokens / 1_000_000) * pricing["input"] + (out_tokens / 1_000_000) * pricing["output"]
    return cost

import re
import time

def llm_judge(question, ground_truth, answer, judge_model=None):
    """Uses an LLM to grade the answer 1-5."""
    if judge_model is None:
        # Initialize a strong judge model (e.g. GPT-4o or similar)
        try:
            # Use 'google' provider if 'gemini' (or vice-versa) to ensure fallback options
            judge_config = config.copy()
            judge_config["llm_model"] = "gpt-4o" # Force strong model if available
            judge_model = get_llm(judge_config)
        except Exception:
            print("Warning: Judge model init failed, using default config.")
            judge_model = get_llm(config)

    prompt_template = """
    You are an impartial judge evaluating a financial analyst's answer.
    
    Question: {question}
    Ground Truth: {ground_truth}
    Student Answer: {answer}
    
    Evaluate the Student Answer based on accuracy, completeness, and tone compared to the Ground Truth.
    Output ONLY a valid JSON object with two keys:
    - "score": an integer 1-5 (1=bad, 5=excellent)
    - "reasoning": a brief explanation.
    Do NOT output markdown formatting, just the raw JSON.
    """
    
    prompt = prompt_template.format(question=question, ground_truth=ground_truth, answer=answer)
    
    # Initialize content OUTSIDE the try/loop to avoid UnboundLocalError
    content = ""
    max_retries = 3
    
    for attempt in range(max_retries):
        try:
            response = judge_model.invoke(prompt)
            content = response.content.strip()
            
            # Enhanced JSON extraction for Gemini/other models
            match = re.search(r'\{.*\}', content, re.DOTALL)
            if match:
                content = match.group(0)
            
            content = content.replace("```json", "").replace("```", "").strip()
            
            result = json.loads(content)
            return result.get("score", 0), result.get("reasoning", "Parse Error")
            
        except Exception as e:
            error_str = str(e)
            is_rate_limit = "429" in error_str or "RESOURCE_EXHAUSTED" in error_str
            
            if is_rate_limit:
                wait_time = 20 * (attempt + 1) # Progressive backoff: 20s, 40s, 60s
                print(f"Judge Rate Limit (429). Waiting {wait_time}s... (Attempt {attempt+1}/{max_retries})")
                time.sleep(wait_time)
                continue  # Retry
            else:
                # Safeguard print against uninitialized content
                debug_content = content[:100] if content else "<Empty/Failed>"
                print(f"Judge Error: {e}\nContent: {debug_content}...")
                return 0, "Error"
                
    return 0, "Error - Rate Limit Exceeded"


## 4. System 1: "The Librarian" (RAG Setup)

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings
from sentence_transformers import CrossEncoder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# --- 1. Connect to Weaviate ---
client = None

# 1. Try Local Docker (Preferred for Windows)
print("1. Trying Local Docker...")
try:
    client = weaviate.connect_to_local()
    if client.is_ready():
        print(" Connected to Local Docker!")
except Exception as e:
    print(f" Local connection failed: {e}")
    client = None

# 2. Try Cloud (WCS) if Local failed
if not client:
    print("2. Trying Weaviate Cloud...")
    try:
        # Fix: Safely get config, fallback to env vars
        vectordb_cfg = config.get("vector_db", config.get("vectordb", {}))
        wcs_url = vectordb_cfg.get("wcs_url") or os.environ.get("WEAVIATE_URL")
        wcs_api_key = vectordb_cfg.get("wcs_api_key") or os.environ.get("WEAVIATE_API_KEY")
        
        if wcs_url and wcs_api_key:
            client = weaviate.connect_to_wcs(
                cluster_url=wcs_url,
                auth_credentials=weaviate.auth.AuthApiKey(wcs_api_key)
            )
            print(" Connected to Weaviate Cloud!")
        else:
            print(" No Cloud credentials found.")
    except Exception as e:
        print(f" Cloud connection failed: {e}")

if not client or not client.is_ready():
    print("CRITICAL: Weaviate connection failed.")
else:
    print(f"Weaviate Ready: {client.is_ready()}")

# --- 2. Embeddings & Reranker ---
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# --- 3. Retrieval Functions ---
def hybrid_search(query, limit=20):
    if not client:
        raise RuntimeError("Weaviate client is not connected.")
    
    # FIX: Get collection name from config
    vectordb_cfg = config.get("vector_db", config.get("vectordb", {}))
    collection_name = vectordb_cfg.get("collection_name", "FinancialReport")
    
    collection = client.collections.get(collection_name)
    query_vector = embedding_model.embed_query(query)
    response = collection.query.hybrid(
        query=query,
        vector=query_vector,
        alpha=0.5,
        limit=limit,
        return_metadata=weaviate.classes.query.MetadataQuery(score=True)
    )
    results = []
    for o in response.objects:
        res = o.properties
        res['score'] = o.metadata.score
        results.append(res)
    return results

def rerank_results(query, retrieved_docs, top_k=5):
    if not retrieved_docs: return []
    pairs = [[query, doc['text']] for doc in retrieved_docs]
    scores = reranker.predict(pairs)
    for i, doc in enumerate(retrieved_docs):
        doc['rerank_score'] = float(scores[i])
    return sorted(retrieved_docs, key=lambda x: x['rerank_score'], reverse=True)[:top_k]

def format_docs(docs):
    return "\n\n".join([f"[Source: Page {d.get('page_number', '?')}] {d['text']}" for d in docs])

# --- 4. RAG Chain ---
rag_template = """
You are a specialized financial analyst assistant.
Use the following context to answer the user's question accurately.
If the answer is not in the context, say "I don't have enough information."
Keep answers professional and concise.

Context:
{context}

Question: {question}
Answer:
"""
rag_prompt = ChatPromptTemplate.from_template(rag_template)
rag_llm = get_llm(config) # Uses default llm_model from config
rag_chain = rag_prompt | rag_llm | StrOutputParser()



1. Trying Local Docker...
 Connected to Local Docker!
Weaviate Ready: True




## 5. System 2: "The Intern" (Mock Fine-Tuned Model)
Since we didn't perform the physical fine-tuning, we will use the **Base Model** directly as a proxy. Realistically, this would be the `PeftModel` loaded from disk.

In [5]:
intern_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a financial analyst specializing in the 2024 Annual Report. Answer strictly based on your internal knowledge and the following question.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

intern_prompt = ChatPromptTemplate.from_template(intern_template)
intern_llm = get_llm(config)
intern_chain = intern_prompt | intern_llm | StrOutputParser()

## 6. Run Evaluation Loop

In [6]:
# Load Test Set
test_set = []
with open(GOLDEN_SET_PATH, 'r') as f:
    for line in f:
        test_set.append(json.loads(line))

print(f"Loaded {len(test_set)} test questions.")

results = []

# Define Judge Model once
judge_config = config.copy()
judge_config["llm_model"] = "gpt-4o-mini" 
judge_config["llm_provider"] = "openrouter" # Ensure provider matches

try:
    judge_bot = get_llm(judge_config)
except:
    print("Warning: Judge model init failed, using default config.")
    judge_bot = get_llm(config)

for i, sample in enumerate(test_set[:15]): # Limit to 15 for speed/cost
    q = sample['question']
    gt = sample['answer']
    
    print(f"Evaluating Q{i+1}/{len(test_set)}...")
    
    # --- 1. Evaluator: Librarian (RAG) ---
    start = time.perf_counter()
    # Retrieval
    retrieved = hybrid_search(q)
    reranked = rerank_results(q, retrieved)
    context_str = format_docs(reranked)
    # Generation
    rag_response = rag_chain.invoke({"context": context_str, "question": q})
    rag_response = rag_chain.invoke({"context": context_str, "question": q})
    time.sleep(10) # 10s delay (15 RPM limit -> ~16s total per q)
    rag_time = time.perf_counter() - start
    
    # --- 2. Evaluator: Intern (Mock) ---
    start = time.perf_counter()
    intern_response = intern_chain.invoke({"question": q})
    intern_time = time.perf_counter() - start
    
    # --- 3. Scoring ---
    
    # ROUGE
    rag_rouge = calculate_rouge(gt, rag_response)
    intern_rouge = calculate_rouge(gt, intern_response)
    
    # Judge
    rag_score, rag_reason = llm_judge(q, gt, rag_response, judge_bot)
    intern_score, intern_reason = llm_judge(q, gt, intern_response, judge_bot)
    
    # Cost (Est.)
    rag_cost = calculate_cost(context_str + q, rag_response, config["llm_model"])
    intern_cost = calculate_cost(q, intern_response, config["llm_model"])
    
    results.append({
        "question": q,
        "ground_truth": gt,
        "librarian_answer": rag_response,
        "intern_answer": intern_response,
        "librarian_time": rag_time,
        "intern_time": intern_time,
        "librarian_rouge": rag_rouge,
        "intern_rouge": intern_rouge,
        "librarian_score": rag_score,
        "intern_score": intern_score,
        "librarian_cost": rag_cost,
        "intern_cost": intern_cost,
        "librarian_judge_reason": rag_reason,
        "intern_judge_reason": intern_reason
    })
    time.sleep(10) # 10s delay (15 RPM limit -> ~16s total per q)

# Save Results
df_res = pd.DataFrame(results)
os.makedirs("../data/results", exist_ok=True)
df_res.to_json(RESULTS_PATH, orient="records", indent=2)
print(f"Evaluation Complete. Results saved to {RESULTS_PATH}")

Loaded 600 test questions.
Evaluating Q1/600...
Evaluating Q2/600...
Evaluating Q3/600...
Evaluating Q4/600...
Evaluating Q5/600...
Evaluating Q6/600...
Evaluating Q7/600...
Evaluating Q8/600...
Evaluating Q9/600...
Evaluating Q10/600...
Evaluating Q11/600...
Evaluating Q12/600...
Evaluating Q13/600...
Evaluating Q14/600...
Evaluating Q15/600...
Evaluation Complete. Results saved to ../data/results/rag_evaluation_results.json


## 7. Results & Analysis

In [7]:
# Summary Table
summary = df_res[[
    "librarian_time", "intern_time", 
    "librarian_rouge", "intern_rouge", 
    "librarian_score", "intern_score",
    "librarian_cost", "intern_cost"
]].mean()

print("--- Average Metrics ---")
print(summary)

# Detailed View
display(df_res[["question", "librarian_score", "intern_score", "librarian_rouge", "intern_rouge"]].head())

--- Average Metrics ---
librarian_time     13.757364
intern_time         5.000468
librarian_rouge     0.400217
intern_rouge        0.164422
librarian_score     4.133333
intern_score        4.000000
librarian_cost      0.000196
intern_cost         0.000139
dtype: float64


Unnamed: 0,question,librarian_score,intern_score,librarian_rouge,intern_rouge
0,How does the tone of the text reflect the comp...,1,3,0.0,0.162963
1,What types of legislative proposals are curren...,3,5,0.181818,0.013746
2,What strategic goal can be inferred from the s...,4,4,0.181818,0.023529
3,What tone is used when discussing the potentia...,5,5,0.517241,0.236559
4,What could be the potential consequences for t...,5,5,0.598425,0.123324


## 8. Business Cost Analysis
**Scenario**: 500 Daily Users, 10 Queries each = 5,000 queries/day.

In [8]:
DAILY_QUERIES = 5000

rag_daily_cost = summary["librarian_cost"] * DAILY_QUERIES
intern_daily_cost = summary["intern_cost"] * DAILY_QUERIES

print(f"--- ROI / Cost Analysis (Per Day) ---")
print(f"The Librarian (RAG) Cost: ${rag_daily_cost:.2f}")
print(f"The Intern (Fine-Tuned) Cost: ${intern_daily_cost:.2f}")

diff = rag_daily_cost - intern_daily_cost
if diff > 0:
    print(f"Fine-Tuned Model saves ${diff:.2f} per day (${diff*30:.2f}/month).")
else:
    print(f"RAG System saves ${abs(diff):.2f} per day (${abs(diff)*30:.2f}/month).")

--- ROI / Cost Analysis (Per Day) ---
The Librarian (RAG) Cost: $0.98
The Intern (Fine-Tuned) Cost: $0.70
Fine-Tuned Model saves $0.29 per day ($8.58/month).
