## Week 8 Lab Manual
### Foundations of Deep Learning & AI Functionality

**Instructor Note**: This lab manual provides the aim, code, and explanation for each practical task. Focus on the architectural patterns and the transition from theoretical concepts to functional AI implementations.

---

# Week 8: Model Evaluation & Monitoring

Welcome to Week 8. As we move towards production (Week 9), we must answer the most critical question: **"How do we know if our model / agent is actually good?"**

###  Weekly Table of Contents
1. [Building an LLM-as-a-Judge Pipeline](#-Lab-8.1:-Building-an-LLM-as-a-Judge-Pipeline)
2. [Evaluating RAG with the "RAG Triad"](#-Lab-8.2:-Evaluating-RAG-with-the-"RAG-Triad")
- LLM-as-a-Judge Logic
- Automated Scoring with JSON Parsers
- The RAG Triad: Faithfulness, Relevance, and Context

###  Learning Objectives
1.  Understand the concept of **LLM-as-a-Judge**.
2.  Learn how to define evaluation metrics for RAG and Agents.
3.  Use Gemini 1.5 Flash to evaluate the outputs of our local Ollama models.
4.  Build an automated evaluation pipeline.

---
## 8.1 Why Evaluation is Hard

Unlike traditional software, LLM outputs are non-deterministic and text-based. Checking for "exact matches" is useless. Instead, we use:
*   **Benchmarks:** Static datasets (MMLU, GSM8K).
*   **Human-in-the-loop:** Expensive and slow.
*   **LLM Judges:** Using a superior model (Gemini 1.5 Flash) to grade a smaller model (Gemma).

---

In [None]:
# ðŸ“¦ WEEK 8 INITIALIZATION
import os
from dotenv import load_dotenv
from IPython.display import Markdown, display

# LANGCHAIN EVALUATION
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# --- CONFIGURATION ---
load_dotenv(override=True)
MODEL = "gemini-1.5-flash"
LOCAL_MODEL = "gemma2:2b"

# The Judge (Cloud)
judge_llm = ChatGoogleGenerativeAI(model=MODEL, temperature=0)

# The Model being tested (Local or lower-capability)
test_llm = Ollama(model=LOCAL_MODEL)

print(f"âœ… Week 8 Ready: Evaluation using {MODEL} as Judge.")


##  Lab 8.1: Building an LLM-as-a-Judge Pipeline
**Aim**: To establish an automated quality assurance workflow where a high-capability model (Gemini 1.5 Flash) assesses the performance of a local "student" model (Gemma 2).

**Explanation**:
This lab implements the "LLM-as-a-Judge" pattern:
1.  **Metric Definition**: We define a 1-5 scale for "Faithfulness" and "Accuracy."
2.  **Structured Evaluation**: Using a `JsonOutputParser`, we force the judge to provide both a quantitative score and a qualitative reason.
3.  **Benchmarking**: This allows developers to iterate on prompts or RAG settings and see a numerical improvement in model performance without manual review.

In [None]:
from langchain_core.output_parsers import JsonOutputParser

eval_prompt = ChatPromptTemplate.from_template("""
You are an unbiased evaluator. Grade the 'Student Response' based on its 'Reference Facts'.
Give a score from 1 to 5 (5 being perfectly accurate) and a brief reasoning.

Reference Facts: {reference}
Student Response: {response}

Return your answer in the following JSON format:
{{
    "score": int,
    "reasoning": "string"
}}
""")

eval_chain = eval_prompt | judge_llm | JsonOutputParser()

# Test Case
reference_fact = "The capital of France is Paris. It has the Eiffel Tower."
student_response = "Paris is the capital of France and is known for the Eiffel Tower."

print("Running LLM-as-a-Judge Evaluation...")
result = eval_chain.invoke({"reference": reference_fact, "response": student_response})
print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")

##  Lab 8.2: Evaluating RAG with the "RAG Triad"
**Aim**: To implement the industry-standard "RAG Triad" evaluation framework to identify specific points of failure in a retrieval-augmented generation system.

**Explanation**:
We break down RAG performance into three distinct components:
1.  **Context Relevance**: Checks if the retriever found the right information.
2.  **Faithfulness**: Ensures the generator didn't hallucinate or add outside knowledge.
3.  **Answer Relevance**: Verifies that the final output actually answers the user's specific question.
By measuring these separately, we can determine whether to fix the "Retrieval" (Vector DB) or the "Generation" (Prompting/Model).

In [None]:
def calculate_relevance(question, context):
    """
    Measures Context Relevance (Part of the RAG Triad)
    """
    prompt = f"""
    On a scale of 0 to 1, how relevant is the context below to the question? 
    Return ONLY a numerical value.
    
    Question: {question}
    Context: {context}
    Relevance Score:"""
    
    response = judge_llm.invoke(prompt)
    try:
        # Extracting numerical value from potential text
        score_text = response.content.strip()
        score = float(score_text)
        return score
    except:
        return 0.0

# Example Scenario
test_question = "What is the speed of light?"
test_context = "The speed of light in a vacuum is exactly 299,792,458 metres per second."

print(f"Testing RAG Triad - Context Relevance...")
score = calculate_relevance(test_question, test_context)
print(f"Context Relevance Score: {score}")

In [None]:
# Final summary and completion

print("="*60)
print("WEEK 8 COMPILED NOTEBOOK - SETUP COMPLETE")
print("="*60)
print()
print("âœ… Evaluation environment initialized using Gemini 1.5 Flash as Judge")
print("âœ… Lab 8.1: LLM-as-a-Judge pipeline with JSON parsing ready")
print("âœ… Lab 8.2: RAG Triad evaluation logic ready")
print()
print("ðŸš€ Ready to scientifically measure model performance!")
print("ðŸ“Š Use eval_chain.invoke() to test your student models.")
print("="*60)


---

##  Instructor's Evaluation & Lab Summary

###  Assessment Criteria
1. **Technical Implementation**: Adherence to the lab objectives and code functionality.
2. **Logic & Reasoning**: Clarity in the explanation of the underlying AI principles.
3. **Best Practices**: Use of secure environment variables and structured prompts.

**Lab Completion Status: Verified**
**Focus Area**: Language Modelling & Deep Learning Systems.