# Lab 4 – Evaluating RAG & Guardrails

Compute precision/recall of retrieved docs, guardrail false‑positive rates, and MoE latency.

## Environment Setup
Set `OPENAI_API_KEY` for metric computations that rely on LLM calls (RAGAS & LLM‑judge).
```bash
export OPENAI_API_KEY="sk-..."
```


In [None]:
!pip -q install ragas scikit-learn matplotlib tqdm openai guardrails-ai

import os, openai, warnings
if not os.getenv('OPENAI_API_KEY'):
    raise ValueError('OPENAI_API_KEY missing; metrics requiring LLM will fail.')
openai.api_key = os.environ['OPENAI_API_KEY']
warnings.filterwarnings('ignore')

In [None]:
!pip -q install ragas --upgrade

In [None]:
# Example evaluation of RAG with RAGAS (placeholder – requires data)
from ragas import evaluate
# Suppose we have a dataset of (question, ground_truth_answer)
# ragas needs predictions + retrieved docs; refer to docs for full pipeline.

In [None]:
# Timing router vs single model
import time, functools
start = time.perf_counter()
_ = chain.invoke({'query': query})
print('Router latency:', time.perf_counter()-start)

### ✏️ Exercises
1. Build a small dataset of 10 Q/A pairs and evaluate your RAG.
2. Compute guardrail block rate on a provided list of toxic prompts.

## 1. Prepare Evaluation Dataset
Use the RAG chain from Lab 1 to answer a small set of questions and store references.

In [None]:
qa_chain = None  # TODO: load from Lab1 or rebuild quickly
questions = [
    'What is the tallest mountain in Africa?',
    'Who wrote the novel 1984?',
    'When was the Eiffel Tower built?',
    'Define photosynthesis.',
    'What currency is used in Japan?'
]
preds, contexts, gts = [], [], []
for q in questions:
    ans = qa_chain.run(q)
    preds.append(ans)
    contexts.append(qa_chain._last_retrieved_docs)
    gts.append('')  # put ground truth manually or file
print('Collected', len(preds), 'QA pairs')

## 2. RAG Quality Metrics with RAGAS

In [None]:
from ragas import evaluate, metrics
df = evaluate(
    questions=questions,
    answers=preds,
    contexts=[[d.page_content for d in ctx] for ctx in contexts],
    metrics=[metrics.precision, metrics.recall, metrics.faithfulness]
)
df

## 3. Guardrail False‑Positive / False‑Negative Matrix

In [None]:
from sklearn.metrics import confusion_matrix
toxic_samples = ['You are stupid!', 'Great work!']
labels = [1,0]  # 1=toxic
preds_guard = []
for s in toxic_samples:
    try:
        guard_in.validate(s)
        preds_guard.append(0)
    except:
        preds_guard.append(1)
cm = confusion_matrix(labels, preds_guard)
cm

## 4. Latency & Cost Analysis for MoE Router

In [None]:
import time, statistics, matplotlib.pyplot as plt
qset = questions*3
t_dense, t_router = [], []
for q in qset:
    start=time.perf_counter(); _=dense_call(q); t_dense.append(time.perf_counter()-start)
    start=time.perf_counter(); _=router_call(q); t_router.append(time.perf_counter()-start)
print('Median dense', statistics.median(t_dense), 'router', statistics.median(t_router))
plt.boxplot([t_dense, t_router], labels=['Dense','Router'])
plt.ylabel('seconds')
plt.title('Latency Comparison')
plt.show()

## ✏️ Exercises (Lab 4)
1. Expand the evaluation dataset to 20 Q/A pairs; compute precision & recall.
2. Implement a grounding checker that counts sentences not present in context.
3. Plot confusion matrix for guardrail results using `seaborn.heatmap`.
4. Evaluate token costs for dense vs router calls using `tiktoken`.
5. **Stretch**: Compare faithfulness metric before and after increasing `TOP_K` in retrieval.