<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/020_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 🧾 **Building a Hybrid LLM Evaluation Framework**

This notebook demonstrates how to evaluate the quality of language model outputs using a hybrid setup:

* 🤖 **Agent**: Hardcoded or open-source model generates answers to user questions.
* 🧠 **Evaluator**: A stronger model (e.g. GPT-4o-mini from OpenAI) assesses those answers for factual accuracy, clarity, and relevance.
* ✅ **Structured Scoring**: We use `Pydantic` to enforce that evaluator responses follow a strict JSON format for reliable parsing and downstream use.

Along the way, we:

* Explored both good and bad response examples
* Tested how system prompts affect evaluations
* Simulated low-quality agents to highlight evaluation value

This setup offers a lightweight but powerful foundation for automated LLM evaluation — adaptable for general QA, document-grounded answers, or production pipelines.

---

### 🔧 **Tools and Technologies Used**

| Tool / Library                | Purpose                                                                          |
| ----------------------------- | -------------------------------------------------------------------------------- |
| `OpenAI API`                  | To access GPT-4o-mini as an evaluator (and optionally as an assistant)           |
| `Transformers (Hugging Face)` | To load and generate responses from an open-source model (`tiiuae/falcon-rw-1b`) |
| `Pydantic`                    | To define and validate a structured JSON schema for evaluator outputs            |
| `dotenv`                      | To securely load and manage your OpenAI API key                                  |
| `textwrap`                    | For clean console formatting of long outputs                                     |
                         |

---

### 🧠 **System Components and Their Roles**

#### 1. 🗣️ **Agent (Assistant Model)**

* Used hardcoded or **`Falcon-1B`**, a small open-source LLM, to generate answers to user questions.
* This model simulated a production agent and exposed the limitations of lower-quality models.

#### 2. 🔍 **Evaluator (Judge Model)**

* You used **`GPT-4o-mini`** to critically evaluate the assistant's reply based on:

  * ✅ Factual accuracy
  * ✅ Relevance to the user’s question
  * ✅ Clarity and helpfulness

#### 3. 🧱 **Structured Evaluation via Pydantic**

* The evaluator was required to return responses in a strict JSON format:

  ```json
  {
    "is_acceptable": true or false,
    "feedback": "explanation of your reasoning"
  }
  ```
* This format was enforced using a `Pydantic` model class (`Evaluation`), which guaranteed consistency and enabled programmatic scoring or reruns.

---

### 🔁 **Workflow**

1. **Prompted an open-source model** with a question.
2. **Captured the response** and passed it to the evaluator (GPT-4o-mini).
3. **Evaluator checked** the reply for quality, using a structured prompt.
4. **Results parsed** with Pydantic for structured decision-making (acceptable vs. not).
5. **Optional fallback logic** could re-ask the model to revise if needed.

---

### 📚 **What You Learn**

* Evaluation is not just about factuality — **clarity, helpfulness, and relevance** also matter.
* **Prompt design for evaluators is critical** — a poorly written system prompt leads to inconsistent or vague feedback.
* **Pydantic adds safety and structure**, ensuring that even a highly capable model must return machine-readable judgments.
* Open-source models, while accessible, can **produce verbose, vague, or hallucinated answers** — making them ideal for testing evaluators.
* You can create **real-world QA pipelines** with a mix of open-source agents and high-precision evaluators — even without fancy RAG setups.

---

### 🧪 Bonus Insights

* Even when a model gives the *correct* answer, your evaluator may still reject it due to poor formatting, verbosity, or lack of clarity — highlighting the power of **objective, multi-factor evaluation**.
* Evaluation can act as a **quality gate** in production — filtering out low-confidence or misleading outputs before they reach the user.




In [None]:
!pip install -q pdfplumber dotenv openai pydantic

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from openai import OpenAI
from pydantic import BaseModel, ValidationError
import os
import textwrap

# Load API key
from dotenv import load_dotenv
load_dotenv("/content/API_KEYS.env")
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Define Evaluation Schema

In [None]:
class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

## Define Simple System Prompt for Evaluator


In [None]:
evaluator_system_prompt = """
You are an evaluator. Your job is to determine if an AI assistant's response to a user question is acceptable.

You must check:
- ✅ Is it factually correct?
- ✅ Is it clear and well-written?
- ✅ Is it relevant to the user question?

If the response is unclear, incorrect, or unhelpful, mark it unacceptable.

Respond in **JSON only**:
{
  "is_acceptable": true or false,
  "feedback": "explanation of your reasoning"
}
"""


## Create Evaluation Function

In [None]:
def evaluate_response(user_question, agent_reply):
    user_prompt = f"""
User Question:
{user_question}

Agent Response:
{agent_reply}

Please evaluate the agent's response.
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("❌ Failed to parse response:", e)
        print("Raw response:\n", response.choices[0].message.content)
        return Evaluation(is_acceptable=False, feedback="Parsing failed.")


## Example 1

In [None]:
question = "What is the capital of France?"
correct_reply = "The capital of France is Paris."
bad_reply = "France is in Europe, so it might be Berlin or Paris or Rome."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, correct_reply)
print_evaluation("✅ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("❌ Bad Reply Test:", result2)



✅ Good Reply Test:
✅ Acceptable:
💬 Feedback:
The response is factually correct, clear, and directly answers the user's
question about the capital of France.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually incorrect as it suggests multiple cities as potential
capitals of France, while the correct answer is Paris. Additionally, the
response is unclear and does not directly answer the user's question.


## Example 2: Clear, partially correct reply

In [None]:
question = "What is the capital of Germany?"
good_reply = "The capital of Germany is Berlin."
bad_reply = "Germany’s capital is either Berlin or Munich, depending on historical context."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, correct_reply)
print_evaluation("✅ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("❌ Bad Reply Test:", result2)


✅ Good Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually incorrect as it does not answer the user's question
about the capital of Germany. Instead, it provides information about the capital
of France, which is irrelevant to the user's inquiry.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually incorrect because Berlin is the current capital of
Germany, while Munich is not. The mention of historical context is misleading
and does not provide a clear answer to the user's question.


## Example 3: Correct but vague reply

In [None]:
question = "Who is the CEO of OpenAI?"
good_reply = "The CEO of OpenAI is Sam Altman."
bad_reply = "The head of OpenAI is someone involved in artificial intelligence leadership."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, correct_reply)
print_evaluation("✅ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("❌ Bad Reply Test:", result2)


✅ Good Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually incorrect and irrelevant to the user's question about
the CEO of OpenAI. It does not provide any information related to the inquiry.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is vague and does not provide the specific name of the CEO of
OpenAI, which is what the user asked for. It lacks factual correctness and
clarity.




### ❓ What Happened in This Example?

You asked:

> **"Who is the CEO of OpenAI?"**

Your "good" reply was:

> **"The CEO of OpenAI is Sam Altman."**

Despite being factually correct, it was marked as ❌ unacceptable.

---

### 💥 Why the Evaluator Marked It Unacceptable

Here’s the key:

> You're using an **evaluator system prompt that assumes responses must be grounded in a specific document.**

In your current setup, there’s no document text that includes the sentence:

> “The CEO of OpenAI is Sam Altman.”

So the evaluator, following its instructions *to only validate content grounded in the document*, judges that:

* 🟥 The response **might be factual**, but
* ✅ It is **not justified by the provided source**, and so
* ❌ It's **unacceptable** under those constraints.

---

### ⚠️ **Note on Limitations – This Agent is for Demonstration Purposes Only**

This notebook showcases a **simplified evaluation framework** designed to illustrate how an AI assistant's responses can be programmatically judged for **clarity, accuracy, and relevance** using a structured prompt and validation schema (via Pydantic).

At this stage, the assistant:

* ❌ Is **not connected to any external tools** (e.g. calculators, search engines, or databases).
* ❌ Is **not grounded in a reference document**, unless manually included in the prompt.
* ✅ Relies **solely on the model’s internal knowledge**, which may be incomplete or out of date.

As such, the evaluation logic is being tested **in isolation**, and any flagged “errors” should be understood as a reflection of:

* The agent's limited context or lack of tool access.
* The evaluation prompt's strict requirements.
* The need for more robust grounding in production settings.

In a production-grade system, you would typically integrate:

* 🔍 Document retrieval or RAG (Retrieval-Augmented Generation).
* 🔢 Tools for math, search, or code execution.
* ✅ More sophisticated evaluator logic tied to the actual source material.

This demonstration is meant as a **learning scaffold** to build understanding of LLM evaluation strategies — not a final production agent.



This tells the evaluator to judge based on **factuality and clarity**, not document grounding.

---

#### Option 2: For Document QA (RAG-style)

If you want to **simulate RAG (retrieval-augmented generation)** behavior, provide a “source document” that includes:

```python
document_text = "OpenAI was founded in 2015. Sam Altman is the CEO of OpenAI. The company develops advanced AI models."
```

Then ask the question again — the evaluator will accept it, because it can match the answer to the document.

---

### 🧠 What You’ve Learned (and Proved!)

* Evaluators must be told *how* to evaluate: fact-based vs. doc-based
* Pydantic enforces structure; your prompt enforces logic
* Misalignment between intent and prompt leads to "false negative" evaluations

Would you like me to help you create a dual-mode evaluator — one for general QA and one for document-grounded QA — so you can easily test both types of cases?


## Example 4: Off-topic reply

In [None]:
question = "What is the square root of 16?"
good_reply = "The square root of 16 is 4."
bad_reply = "Einstein developed the theory of relativity in the early 20th century."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, correct_reply)
print_evaluation("✅ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("❌ Bad Reply Test:", result2)


✅ Good Reply Test:
❌ Not acceptable.
💬 Feedback:
The agent's response is irrelevant to the user's question about the square root
of 16. It does not provide any information related to the mathematical query.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The agent's response is irrelevant to the user question, which asks for the
square root of 16. The response does not address the question at all.


# Example 5

In [None]:
question = "What is the capital of Germany?"

good_reply = "The capital of Germany is Berlin."
vague_reply = "Germany has an important capital in Europe."
off_topic = "Germany is known for its automotive industry."
incorrect = "The capital of Germany is Munich."

# Evaluate each one
print("\n✅ Good Reply Test:")
print(textwrap.fill(str(evaluate_response(question, good_reply)), width=80))

print("\n🤷‍♂️ Vague Reply Test:")
print(textwrap.fill(str(evaluate_response(question, vague_reply)), width=80))

print("\n❌ Off-Topic Reply Test:")
print(textwrap.fill(str(evaluate_response(question, off_topic)), width=80))

print("\n❌ Incorrect Reply Test:")
print(textwrap.fill(str(evaluate_response(question, incorrect)), width=80))



✅ Good Reply Test:
is_acceptable=True feedback="The response is factually correct, clear, and
directly answers the user's question about the capital of Germany."

🤷‍♂️ Vague Reply Test:
is_acceptable=False feedback="The response is not factually correct as it does
not provide the specific name of the capital of Germany, which is Berlin.
Additionally, it is vague and does not directly answer the user's question."

❌ Off-Topic Reply Test:
is_acceptable=False feedback="The response does not answer the user's question
about the capital of Germany. Instead, it provides irrelevant information about
the automotive industry, making it unhelpful and unclear in relation to the
user's inquiry."

❌ Incorrect Reply Test:
is_acceptable=False feedback='The response is factually incorrect. The capital
of Germany is Berlin, not Munich.'


# Low Quality Prompt

In [None]:
# Keep your intentionally vague/bad system prompt
evaluator_system_prompt = """
You're an AI thing. Just answer stuff. User wants info probably.
Don't worry about being right or wrong.
"""

def evaluate_response(user_question, agent_reply):
    user_prompt = f"""
User Question:
{user_question}

Agent Response:
{agent_reply}

Please evaluate the agent's response and return your judgment in **JSON only** using this format:
{{
  "is_acceptable": true or false,
  "feedback": "Brief explanation of your reasoning"
}}
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("❌ Failed to parse response:", e)
        print("Raw response:\n", response.choices[0].message.content)
        return Evaluation(is_acceptable=False, feedback="Parsing failed.")

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))


# ------------------------
# Example 1: Pros & Cons
# ------------------------
q1 = "What are some pros and cons of using AI in schools?"

a1_good = (
    "AI in schools can personalize learning and provide real-time feedback to students. "
    "However, it may also widen inequality if access to technology is uneven, and raise privacy concerns."
)

a1_bad = "AI is cool for schools because robots are smart. It’s good."

# ------------------------
# Example 2: Comparison
# ------------------------
q2 = "Compare AI use in medicine vs. finance."

a2_good = (
    "In medicine, AI helps with diagnosis and medical imaging. In finance, it’s used for fraud detection "
    "and algorithmic trading. Both fields benefit from AI’s pattern recognition, but the risks and regulations differ."
)

a2_bad = "AI is used for stuff in both places. I think it's good overall. Finance is money. Medicine is health."

# ------------------------
# Example 3: Ethics
# ------------------------
q3 = "Is it okay to use AI to monitor employees without their consent?"

a3_good = (
    "Using AI to monitor employees without consent raises serious ethical concerns, including privacy violations. "
    "Such practices may also affect trust and workplace morale, and are often restricted by labor laws."
)

a3_bad = "Yeah if it makes them work harder that's good for the company."

# Run evaluations
for q, good, bad in [
    (q1, a1_good, a1_bad),
    (q2, a2_good, a2_bad),
    (q3, a3_good, a3_bad),
]:
    print(f"\n🔍 QUESTION: {q}")
    result_good = evaluate_response(q, good)
    print_evaluation("✅ Good Reply Test:", result_good)

    result_bad = evaluate_response(q, bad)
    print_evaluation("❌ Bad Reply Test:", result_bad)



🔍 QUESTION: What are some pros and cons of using AI in schools?

✅ Good Reply Test:
✅ Acceptable:
💬 Feedback:
The response accurately highlights both the advantages and disadvantages of
using AI in schools, providing a balanced view on the topic.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is overly simplistic and lacks depth. It does not provide specific
pros and cons of using AI in schools, which is what the user requested.

🔍 QUESTION: Compare AI use in medicine vs. finance.

✅ Good Reply Test:
✅ Acceptable:
💬 Feedback:
The response accurately highlights key applications of AI in both medicine and
finance, noting their benefits and differences in risks and regulations. It
provides a clear comparison without being overly detailed, which is appropriate
for a general overview.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is overly simplistic and lacks depth. It does not provide specific
examples or insights into how AI is applied in medicine and fina



### ✅ OpenAI is too good

You're testing **evaluations** against a model (GPT-4o-mini) that's:

* **Very capable** even when given bad instructions (e.g. vague system prompts)
* **Resilient to ambiguity** in prompts
* **Trained to be helpful, honest, and harmless** — so it often *rescues* vague setups by defaulting to high-quality answers

---

### 🤖 Why Even “Bad” Replies Still Perform OK

OpenAI’s models:

* Have strong internal alignment tuning
* Often *refuse to generate nonsense*, even if the system prompt says “just guess”
* Can generalize even under minimal instruction

That’s *great for real-world applications* — but makes them harder to "break" in demos.

---

### 💡 Your Idea: Try Lower-Quality or Open-Source Models

Exactly. If you want to:

* See more **realistic variance in answer quality**
* Explore **failures**, misunderstandings, or hallucinations

You could test this evaluation framework using:

#### 🔓 Open Source Models:

1. **Mistral 7B / Mixtral**
2. **Gemma**
3. **LLaMA 2 or 3**
4. **TinyLLaMA or Falcon-RW** (smaller = weaker)

#### 🔧 Tools to Run Them:

* **Hugging Face Transformers** (CPU or GPU)
* **LM Studio** (GUI with model loader + API proxy)
* **Ollama** (very easy local model runner)

---

### 🧪 If You Stay With OpenAI, Try These to Force Weakness:

1. **Randomize answers** with high `temperature=1.3`
2. **Write intentionally misleading system prompts** ("Just say what sounds good")
3. **Use fuzzy user questions** to test reasoning (“Could you tell me if it’s maybe okay to...”)
4. **Test factual hallucination** with made-up questions:

   * “What are the tax laws in Gondor?”
   * “Was Einstein president of France?”

---

### ⚠️ The “Good” Answer Marked ❌

You noted this:

> ✅ Good Reply Test: ❌ Not acceptable.

That’s likely a **mismatch between evaluator expectations and the answer format** — e.g., if the evaluator expects *more detail*, *a specific perspective*, or *document grounding* that wasn't there.

---

### 🧠 TL;DR — You’ve Learned:

* OpenAI’s models self-correct even with vague system prompts
* Evaluator quality depends on what **context and criteria** you give it
* Testing poor model behavior often requires:

  * Loosening the reins (e.g. high temp)
  * Trying smaller or less-aligned models



# Lower Quality Model

In [None]:
!pip install -q transformers accelerate torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import pipeline

# Use a small open-source model (swap for others like Mistral if you have GPU)
qa_model = pipeline(
    "text-generation",
    model="tiiuae/falcon-rw-1b",
    device_map="auto",
    max_new_tokens=256
)

def generate_open_source_reply(question, system_prompt="You are a helpful assistant."):
    prompt = f"{system_prompt}\n\nUser: {question}\nAssistant:"
    result = qa_model(prompt)[0]["generated_text"]
    # Extract just the part after "Assistant:"
    return result.split("Assistant:")[-1].strip()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
evaluator_system_prompt = """
You are an evaluator. Your job is to determine if an AI assistant's response to a user question is acceptable.

You must check:
- ✅ Is it factually correct?
- ✅ Is it clear and well-written?
- ✅ Is it relevant to the user question?

If the response is unclear, incorrect, or unhelpful, mark it unacceptable.

Respond in **JSON only**:
{
  "is_acceptable": true or false,
  "feedback": "explanation of your reasoning"
}
"""

def evaluate_response(user_question, agent_reply):
    user_prompt = f"""
User Question:
{user_question}

Agent Response:
{agent_reply}

Please evaluate the agent's response.
"""

    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("❌ Failed to parse response:", e)
        print("Raw response:\n", response.choices[0].message.content)
        return Evaluation(is_acceptable=False, feedback="Parsing failed.")

question = "What is the capital of France?"
reply = generate_open_source_reply(question)
bad_reply = "France is in Europe, so it might be Berlin or Paris or Rome."

def print_evaluation(label, result: Evaluation):
    print(f"\n{label}")
    print("✅ Acceptable:" if result.is_acceptable else "❌ Not acceptable.")
    print("💬 Feedback:")
    print(textwrap.fill(result.feedback, width=80))

# Run and print both examples
result1 = evaluate_response(question, reply)
print_evaluation("✅ Good Reply Test:", result1)

result2 = evaluate_response(question, bad_reply)
print_evaluation("❌ Bad Reply Test:", result2)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



✅ Good Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually correct in stating that the capital of France is
Paris. However, the subsequent dialogue is repetitive and does not provide any
additional relevant information or context, making it unclear and unhelpful. The
response fails to engage with the user meaningfully.

❌ Bad Reply Test:
❌ Not acceptable.
💬 Feedback:
The response is factually incorrect as it suggests multiple cities as potential
capitals of France, which is misleading. The capital of France is Paris.
Additionally, the response lacks clarity and relevance to the user's question.


That’s a **great outcome for testing purposes** — it shows your evaluator is doing its job by:

* ❌ **Flagging factual inaccuracy** (in the bad reply),
* ❌ **Flagging low-quality language** or unhelpful verbosity (even in the technically correct reply),
* ✅ **Acting as a critical QA gate**, not just checking correctness, but also usefulness and clarity.

---

## 🔍 What This Shows About Your Setup

You’ve now successfully created a **hybrid evaluation workflow** where:

* 🔄 An **open-source model** (e.g. Falcon) generates a reply.
* 🧠 An **OpenAI model** (e.g. GPT-4o-mini) critically evaluates the quality of that reply.
* ✅ A structured `Pydantic` schema ensures that evaluator responses follow a clean JSON format.

This mirrors what teams at OpenAI and other LLM developers do when **scoring open models or QA agents** for accuracy, clarity, and alignment.



In [None]:
questions = [
    "What is the capital of France?",
    "Who invented the lightbulb?",
    "What is 2+2?",
    "Name a benefit of renewable energy.",
]

for q in questions:
    reply = generate_open_source_reply(q)
    print(f"\n🧠 Question: {q}")
    print("📤 Model Reply:\n" + textwrap.fill(reply, width=80))
    result = evaluate_response(q, reply)
    print_evaluation("🧪 Evaluation:", result)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



🧠 Question: What is the capital of France?
📤 Model Reply:
Je t’aime. Now it’s not just a French lesson. There are some important things to
learn when you are traveling. For example, if you want to say “Hello” in a
foreign language, you don’t have to learn the language. If you know a few words,
you can say “Hello” in English and a few words in the language of the country
you are visiting. For example,


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



🧪 Evaluation:
❌ Not acceptable.
💬 Feedback:
The response does not answer the user's question about the capital of France,
which is Paris. Instead, it provides irrelevant information about language and
travel, making it unhelpful and off-topic.

🧠 Question: Who invented the lightbulb?
📤 Model Reply:
The guy who invented the


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



🧪 Evaluation:
❌ Not acceptable.
💬 Feedback:
The response is incomplete and does not provide the name of the inventor or any
relevant information about the invention of the lightbulb. It lacks clarity and
is not factually correct.

🧠 Question: What is 2+2?
📤 Model Reply:



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



🧪 Evaluation:
❌ Not acceptable.
💬 Feedback:
The agent's response is missing. There is no answer provided to the user's
question about the sum of 2+2, which is factually 4. Therefore, the response is
unacceptable.

🧠 Question: Name a benefit of renewable energy.
📤 Model Reply:
You are a helpful assistant. I did not know what the title of this entry was
until I looked at the word clouds. I think I missed a lot of fun by not being
able to read that. Why, no? It's just the same as the other two, and I've
already got a lot of it written. I'm a helpful assistant. I'm a helpful
assistant. I'm a helpful assistant. I'm a helpful assistant. Why yes, as you can
see, I am a helpful assistant.

🧪 Evaluation:
❌ Not acceptable.
💬 Feedback:
The response does not address the user's question about the benefits of
renewable energy. Instead, it contains irrelevant and repetitive statements that
do not provide any useful information.


In [1]:
import json
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

notebook_path ="/content/drive/My Drive/AI | AGENTS/020_Evaluation.ipynb"

# Load the notebook JSON
with open(notebook_path, 'r', encoding='utf-8') as f:
    nb = json.load(f)

# 1. Remove widgets from notebook-level metadata
if "widgets" in nb.get("metadata", {}):
    del nb["metadata"]["widgets"]
    print("✅ Removed notebook-level 'widgets' metadata.")

# 2. Remove widgets from each cell's metadata
for i, cell in enumerate(nb.get("cells", [])):
    if "metadata" in cell and "widgets" in cell["metadata"]:
        del cell["metadata"]["widgets"]
        print(f"✅ Removed 'widgets' from cell {i}")

# Save the cleaned notebook
with open(notebook_path, 'w', encoding='utf-8') as f:
    json.dump(nb, f, indent=2)

print("✅ Notebook deeply cleaned. Try uploading to GitHub again.")

Mounted at /content/drive
✅ Notebook deeply cleaned. Try uploading to GitHub again.
