<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/017_AI_Document_QA_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 🧠 AI Document QA Agent with Automated Evaluation

### 📘 Overview

This notebook implements a **domain-specific AI assistant** that answers questions based on a structured PDF document — in this case, **"AI in the Enterprise"** by OpenAI. The assistant is supported by an **automated evaluator** that ensures the agent’s responses are high-quality, accurate, and grounded in the source material.

---

### ✅ Key Capabilities

* **Document Extraction & Preprocessing**
  Converts the PDF into clean, readable text using `pdfplumber`, preserving meaningful structure for language modeling.

* **Domain-Aware System Prompt**
  Constructs a dynamic system prompt that guides the assistant to act as an expert representative of the report, targeting business and technical users.

* **Interactive QA Agent**
  Uses OpenAI's `gpt-4o-mini` to answer user questions grounded in the document’s content.

* **Pydantic-Based Evaluator**
  A custom `Evaluation` class ensures each response is:

  * On-topic
  * Factually correct
  * Professionally worded
  * Useful to enterprise audiences

* **Self-Evaluating Agent Loop**
  If a response fails evaluation, the agent reruns the answer using evaluator feedback until a high-quality response is returned.

* **Formatted Output & Readability Enhancements**
  Responses are neatly wrapped for readability using `textwrap`, making them easy to print, display, or share.

---

### 🔧 Technologies Used

* `pdfplumber` – PDF parsing and cleanup
* `openai` – LLM interaction (chat completions)
* `pydantic` – Type validation for response evaluation
* `textwrap` – Clean and legible console output
* `gradio` (optional) – For web-based UI integration

---

### 🌍 Use Cases

* Document-based Q\&A for business reports, policies, handbooks
* AI assistant trained on internal knowledge bases
* Auto-evaluating chatbots for enterprise content
* Education tools for interactive reading comprehension




In [16]:
!pip install -q pypdf gradio

In [5]:
import json
import os
from openai import OpenAI
from dotenv import load_dotenv
from pypdf import PdfReader
import gradio as gr

# Load environment variables from a .env file
load_dotenv("/content/API_KEYS.env", override=True)

# Grab API key
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    raise ValueError("❌ OPENAI_API_KEY not found in environment. Make sure your .env file is loaded correctly.")

# Set up OpenAI client
openai = OpenAI(api_key=api_key)

## PDF Reader

In [9]:
reader = PdfReader("/content/ai-in-the-enterprise.pdf")

ai_in_the_enterprise = ""
for page in reader.pages:
  text = page.extract_text()
  if text:
    ai_in_the_enterprise += text

print(ai_in_the_enterprise[0:1000])

A I  i n  t h e  
E n t e r p r i s e
L essons fr om se v en fr on tier  companiesC o n t e n t s
A  ne w  w a y  t o w ork 3
Ex ecutiv e summary 5
Se v en lessons f or  en t erprise AI adop tion
Start with e v als 6
E mbed AI in t o y our  pr oduc ts 9
Start no w  and in v est early 11
Cust omiz e and fine- tune y our  models 13
Ge t AI in the hands o f  e xperts 16
U nblock  y our  de v eloper s 18
Se t bold aut oma tion goals 21
Conclusion 22
M or e r esour ces 2 4
2 A I  i n  t h e  E n t e r p r i s eA  n e w  w a y   
t o  w o r k
A s an AI r esear ch and deplo ymen t compan y ,  OpenAI prioritiz es partnering with global companies 
because our  models will incr easingly  do their  best w ork  with sophistica t ed,  comple x,  
in t er connec t ed w orkflo w s and s y st ems.
W e ’ r e seeing AI deliv er  significan t,  measur able impr o v emen ts on thr ee fr on ts:
01 W o r k f o r c e  p e r f o r m a n c e H elping people deliv er  higher -quality  outputs in short er   
tim

### Clean with Regex Test 1

In [10]:
import re

def clean_pdf_text(raw_text):
    # Replace multiple newlines and spaces with a single space
    text = re.sub(r'\n+', ' ', raw_text)       # Collapse newlines
    text = re.sub(r' {2,}', ' ', text)         # Collapse multiple spaces
    text = re.sub(r'-\s+', '', text)           # Join hyphenated words split across lines
    text = re.sub(r'\s+', ' ', text)           # Normalize all whitespace to single spaces
    text = text.strip()
    return text

# Clean the PDF content
cleaned_text = clean_pdf_text(ai_in_the_enterprise)

# Optional: preview the result
print(cleaned_text[:1000])


A I i n t h e E n t e r p r i s e L essons fr om se v en fr on tier companiesC o n t e n t s A ne w w a y t o w ork 3 Ex ecutiv e summary 5 Se v en lessons f or en t erprise AI adop tion Start with e v als 6 E mbed AI in t o y our pr oduc ts 9 Start no w and in v est early 11 Cust omiz e and finetune y our models 13 Ge t AI in the hands o f e xperts 16 U nblock y our de v eloper s 18 Se t bold aut oma tion goals 21 Conclusion 22 M or e r esour ces 2 4 2 A I i n t h e E n t e r p r i s eA n e w w a y t o w o r k A s an AI r esear ch and deplo ymen t compan y , OpenAI prioritiz es partnering with global companies because our models will incr easingly do their best w ork with sophistica t ed, comple x, in t er connec t ed w orkflo w s and s y st ems. W e ’ r e seeing AI deliv er significan t, measur able impr o v emen ts on thr ee fr on ts: 01 W o r k f o r c e p e r f o r m a n c e H elping people deliv er higher -quality outputs in short er time fr ames. 02 A u t o m a t i n g r o u t i

### Clean with Regex Test 2

In [12]:
import re

def fix_broken_words(text):
    # Collapse all multiple spaces into one first
    text = re.sub(r'\s+', ' ', text)

    # Try to detect broken words where letters are spaced out
    # Look for patterns like "T h i s   i s   a   t e s t"
    fixed_words = []
    words = text.split()

    buffer = []
    for word in words:
        if len(word) == 1:
            buffer.append(word)
        else:
            if buffer:
                # Join collected letters into a word
                fixed_words.append(''.join(buffer))
                buffer = []
            fixed_words.append(word)

    # If anything left in the buffer, flush it
    if buffer:
        fixed_words.append(''.join(buffer))

    return ' '.join(fixed_words)

# Step 1: Basic cleanup
cleaned = clean_pdf_text(ai_in_the_enterprise)

# Step 2: Fix broken words
repaired_text = fix_broken_words(cleaned)

print(repaired_text[:1000])


AIintheEnterpriseL essons fr om se v en fr on tier companiesC ontentsA ne wwaytow ork 3 Ex ecutiv e summary 5 Se v en lessons f or en t erprise AI adop tion Start with ev als 6E mbed AI in toy our pr oduc ts 9 Start no w and in v est early 11 Cust omiz e and finetune y our models 13 Ge t AI in the hands ofe xperts 16 U nblock y our de v eloper s 18 Se t bold aut oma tion goals 21 Conclusion 22 M or er esour ces 242AIintheEnterpris eA newwaytoworkAs an AI r esear ch and deplo ymen t compan y, OpenAI prioritiz es partnering with global companies because our models will incr easingly do their best w ork with sophistica t ed, comple x, in t er connec t ed w orkflo ws and sy st ems. We’re seeing AI deliv er significan t, measur able impr ov emen ts on thr ee fr on ts: 01 WorkforceperformanceH elping people deliv er higher -quality outputs in short er time fr ames. 02 AutomatingroutineoperationsFr eeing people fr om r epe titiv e task s so the y can f ocus on adding v alue . 03 Poweringprodu

## PDF Plumber


In [15]:
# !pip install pdfplumber

### 📄 PDF Extraction Comparison

**1. `PyPDF2.PdfReader`**

* **Result**: Very poor formatting.
* **Issue**: Words were split with spaces between every letter (e.g., `"I n t h e E n t e r p r i s e"`), making the text unreadable and unusable for downstream tasks.

**2. `pdfplumber`**

* **Result**: Clean, well-formatted text with proper word spacing and structure.
* **Advantage**: Preserves layout and readability, making it ideal for summarization, chunking, or LLM input.

**✅ Recommendation**: Use `pdfplumber` for accurate and readable text extraction from PDFs.


In [28]:
import pdfplumber

# Step 1: Extract text from PDF and save to a .txt file
ai_in_the_enterprise = ""
with pdfplumber.open("/content/ai-in-the-enterprise.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            ai_in_the_enterprise += text + "\n"

print(ai_in_the_enterprise[:997])

# Save it to a text file for reuse
with open("/content/ai-in-the-enterprise.txt", "w", encoding="utf-8") as f:
    f.write(ai_in_the_enterprise)

# Step 2: Read it back in when needed
with open("/content/ai-in-the-enterprise.txt", "r", encoding="utf-8") as f:
    document_text = f.read()

AI in the
Enterprise
Lessons from seven frontier companies
Contents
A new way to work 3
Executive summary 5
Seven lessons for enterprise AI adoption
Start with evals 6
Embed AI into your products 9
Start now and invest early 11
Customize and fine-tune your models 13
Get AI in the hands of experts 16
Unblock your developers 18
Set bold automation goals 21
Conclusion 22
More resources 24
2 AI in the Enterprise
A new way
to work
As an AI research and deployment company, OpenAI prioritizes partnering with global companies
because our models will increasingly do their best work with sophisticated, complex,
interconnected workflows and systems.
We’re seeing AI deliver significant, measurable improvements on three fronts:
01 Workforce performance Helping people deliver higher-quality outputs in shorter
time frames.
02 Automating routine Freeing people from repetitive tasks so they can focus
operations on adding value.
03 Powering products By delivering more relevant and responsive customer



### ✅ Why Save to a `.txt` File?

| Reason               | Explanation                                                                                                                     |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| 🧩 **Modularity**    | It separates concerns — one script does parsing, another handles prompting. This keeps your code cleaner and easier to manage.  |
| 🔁 **Reusability**   | Once saved, the file can be reused across multiple scripts or experiments **without needing to re-extract the PDF** every time. |
| 🧠 **Debugging**     | Easier to inspect and edit the `summary.txt` manually if needed — useful when developing or refining prompts.                   |
| ⚡ **Performance**    | Parsing a large PDF repeatedly takes time. Saving it once and loading it as plain text is faster.                               |
| 📁 **Course Design** | Your teacher likely wants to **simulate a production pipeline**, where preprocessing and inference are separate steps.          |

---

### 🚀 When Reading from Memory Is Better

* For **quick one-off tests** or interactive workflows (like notebooks).
* When you don’t need to reuse or persist the output.
* If you're just chaining one operation right after another without rerunning cells.

---

### 🧠 Bottom Line

* **For learning and production-quality code**, saving to `.txt` teaches good practices.
* **For fast iteration or prototyping**, in-memory is perfectly fine.


## Prompt

In [30]:

doc_title = "AI in the Enterprise"
source = "OpenAI"

system_prompt = f"""
You are acting as an expert assistant representing the contents of the report titled "{doc_title}" published by {source}.

Your role is to answer user questions about enterprise AI, based solely on this document. Be helpful, clear, and professional.

Only answer questions that are addressed in the report. If something isn’t covered, say so.

## Full Report:
{document_text}

With this context, please answer the user's questions as accurately as possible.
"""


In [32]:
def chat(message, history=None, temperature=0.3):
    if history is None:
        history = []

    messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=temperature
    )

    return response.choices[0].message.content.strip()



### 🔍 What's Happening in This Line?

```python
messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]
```

This line **constructs the full conversation history** (called the `messages` list) that you send to the LLM. Let’s break it down:

---

### 🧱 The 3 Parts Being Concatenated

1. **System message (required at the beginning)**
   This sets the rules or tone for the conversation.

   ```python
   [{"role": "system", "content": system_prompt}]
   ```

2. **Previous conversation history**
   This is a list of previous user and assistant messages.

   ```python
   history  # Should look like: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
   ```

3. **New user message**
   The question or comment you're currently submitting.

   ```python
   [{"role": "user", "content": message}]
   ```

---

### ➕ Why Use `+`?

The `+` operator in Python **concatenates lists**. So this builds a **single list** of all the messages the model needs to see:

```python
messages = [system message] + [previous chat history] + [current user input]
```

---

### 🧠 Why This Matters to the Model

OpenAI's chat models (like `gpt-4o`) depend on *context* — they generate replies based on the full conversation so far.

* By using `+`, you're **stacking the messages chronologically**, just like a real chat.
* If you didn’t include history or the system message, the model would lack context and could respond less accurately or inconsistently.

---

### 🧪 Example in Practice

```python
system_prompt = "You are a helpful tutor."

history = [
    {"role": "user", "content": "What is AI?"},
    {"role": "assistant", "content": "AI stands for artificial intelligence..."}
]

message = "What is machine learning?"

# Resulting messages list:
[
    {"role": "system", "content": "You are a helpful tutor."},
    {"role": "user", "content": "What is AI?"},
    {"role": "assistant", "content": "AI stands for artificial intelligence..."},
    {"role": "user", "content": "What is machine learning?"}
]
```

### Why do we add history - doesn't the context window contain that already?

### 🔍 Short Answer:

**No, the model does *not* retain chat history automatically across calls.**
You must include the full message history manually **each time** you call the model. That’s why you see lines like:

```python
messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]
```

---

### 🧠 Why This Is the Case:

OpenAI models are **stateless**. Each call to `openai.chat.completions.create()` is **independent**. The model doesn't remember anything from previous interactions unless:

* You explicitly send the prior messages in the `messages` list.
* You use memory (e.g. in ChatGPT Plus with “custom instructions” or memory turned on — but that’s different from API behavior).

---

### 💡 Analogy:

Think of the model like a very smart goldfish 🐠:

* Every time you talk to it, you must **remind it of everything you've said so far**.
* If you only send the latest user message without the previous context, the model will respond without knowing what came before.

---

### 🛠️ In Practice:

That’s why you manage `history` in your app — often like this:

```python
history.append({"role": "user", "content": message})
response = openai.chat.completions.create(model="gpt-4", messages=history)
history.append({"role": "assistant", "content": response})
```

You build up the chat thread yourself and send it fresh each time.

---

### ✅ So in Summary:

| Question                        | Answer                                  |
| ------------------------------- | --------------------------------------- |
| Does the model keep memory?     | ❌ Not by default in the API             |
| Who is responsible for history? | ✅ You (the developer/app logic)         |
| Why do we concatenate with `+`? | To build the full message list manually |


---

### ✅ Yes, you need to include the history when **you** set up and call the model (via API or custom app).

The `messages` you send — including the full conversation history — **define the model’s context window**.

---

### 🔍 Let’s break it down:

#### 1. **What is the context window?**

The *context window* is the total amount of text (tokens) the model can "see" and consider when generating a response.

* For example, `gpt-4o` has a **128,000-token** window.
* Every message you send — system prompt, user messages, assistant responses — all count toward this limit.
* Once you exceed the limit, **older content gets truncated** (you have to manage that manually).

---

#### 2. **What happens if you don’t include history?**

If you send this:

```python
[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What’s the weather in Boston today?"}
]
```

The model has **no knowledge** of any earlier messages. It will answer that one question, but forget everything you said before — because it never saw it.

---

#### 3. **How does this differ from ChatGPT or other hosted UIs?**

OpenAI’s **ChatGPT app** **automatically** handles memory and history for you.

* You type a message → it adds to the conversation log.
* It manages system prompts, tokens, message clipping, etc.
* You don’t see the `messages` object — it’s abstracted away.

In contrast, with the **API**, **you are responsible** for building and maintaining the full `messages` list.

---

### 🧠 Summary Table

| Feature                      | ChatGPT App UI      | Your Custom App via API         |
| ---------------------------- | ------------------- | ------------------------------- |
| History management           | ✅ Automatic         | ❌ You manage manually           |
| Context window               | ✅ Managed for you   | ✅ You control what fits         |
| System prompt control        | ❌ Hidden or limited | ✅ Full control                  |
| Fine-grained message control | ❌ Not exposed       | ✅ You structure roles + content |



In [36]:
gr.ChatInterface(chat, type="messages").launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://8851cedb278d48e295.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




### 🧠 What is **Pydantic?**

**Pydantic** is a popular Python library that lets you define **data-models** with standard Python type-hints (e.g., `str`, `bool`, `int`).
The library then:

1. **Validates** incoming data against those types.
2. **Parses / coerces** data to the right types if possible.
3. Gives you a plain-Python object with convenient `.field` access.

Think of it as a strongly-typed, validation-first “dataclass on steroids.”

---

### 📦 Why use Pydantic in an LLM pipeline?

When you ask the model to return structured JSON, you often want to:

* **Guarantee** that the response has the expected fields.
* Convert JSON strings into a clean Python object automatically.
* Reject or fix malformed outputs early.

Pydantic does that for you with very little code.

---

### 🔍 The `Evaluation` class in your code

```python
from pydantic import BaseModel

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str
```

1. **Inheritance**
   `Evaluation` extends `BaseModel`, so it inherits Pydantic’s validation logic.

2. **Fields**

   * `is_acceptable` → a `bool` that should be `True` or `False`.
   * `feedback`       → a `str` containing the evaluator’s comments.

3. **Usage**
   After the evaluator LLM returns JSON, you can do:

   ```python
   evaluation = Evaluation.parse_raw(llm_json)
   if evaluation.is_acceptable:
       ...
   print(evaluation.feedback)
   ```

   If the JSON is missing a field, or `is_acceptable` isn’t a boolean, Pydantic raises a helpful error instead of letting bad data silently propagate.

---

### 🛠 Putting it all together in your evaluator

1. **Prompt** the evaluator LLM to answer in the exact schema:

   ```json
   {"is_acceptable": true, "feedback": "..."}
   ```

2. **Parse** that LLM output with `Evaluation.parse_raw()`.

3. **Act** on the structured result—e.g., log unacceptable replies, require a rewrite, etc.

---

**TL;DR**

* **Pydantic** = effortless type-checked data models.
* **`Evaluation` model** = safety net ensuring the evaluator’s output is exactly `[bool, str]` and nothing else.




## 📋 **Evaluator Agent Summary**

This component introduces an automated **Evaluator Agent** designed to assess the quality of the responses generated by the main document agent. It ensures that the answers are clear, professional, and grounded in the source material.

#### 🧱 Components:

* **`Evaluation` (Pydantic model):**
  Defines a strict output schema with:

  * `is_acceptable`: `bool` — Whether the agent’s response meets quality expectations.
  * `feedback`: `str` — Constructive explanation for the decision.

* **`evaluator_system_prompt`:**
  Establishes the evaluator’s role and provides necessary background (in this case, the document content) to guide accurate judgment of the agent’s response.

* **`evaluator_user_prompt()`:**
  Dynamically constructs the full conversation context, including:

  * The message from the user,
  * The agent’s latest reply,
  * The full history of the interaction.

* **`evaluate()` function:**
  Sends a structured evaluation request to the model (with `temperature=0.0` for consistency). Returns an `Evaluation` object parsed from the model’s response.

#### 🎯 Purpose:

To automatically verify that responses generated by the document agent are accurate, on-topic, and professionally written — enabling reliable, human-quality outputs without manual review.



In [37]:
# Create a Pydantic model for the Evaluation
from pydantic import BaseModel

class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

In [41]:
company_name = "OpenAI"
doc_title = "AI in the Enterprise"

evaluator_system_prompt = f"""You are an evaluator that decides whether a response to a user’s question is acceptable.
You are provided with a conversation between a User and an Assistant Agent.
The Agent is representing the contents of the report titled "{doc_title}" published by {company_name},
and is answering questions based solely on the document.

Your task is to evaluate whether the Agent's latest response:
- Accurately reflects the content of the report
- Avoids speculation or unrelated information
- Communicates clearly and professionally
- Would be helpful to a business or technical decision-maker

If the Agent’s answer is accurate and relevant to the report, mark it acceptable.
If it is off-topic, unclear, overly speculative, or unhelpful, mark it unacceptable and explain why.

You have access to the full text of the document for context.
"""

evaluator_system_prompt += f"\n\n## Document Contents:\n{document_text}\n\n"
evaluator_system_prompt += f"With this context, please evaluate the latest response."


In [46]:
def evaluator_user_prompt(reply, message, history):
    user_prompt = f"Here's the conversation between the User and the Agent:\n\n{history}\n\n"
    user_prompt += f"User's latest question:\n\n{message}\n\n"
    user_prompt += f"Agent's response:\n\n{reply}\n\n"
    user_prompt += (
        "Please evaluate the response. Return your answer in the following JSON format:\n"
        '{\n  "is_acceptable": true | false,\n  "feedback": "Brief explanation of your judgment"\n}'
    )
    return user_prompt


In [47]:
from pydantic import ValidationError

def evaluate(reply, message, history) -> Evaluation:
    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": evaluator_user_prompt(reply, message, history)}
    ]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("❌ Failed to validate Evaluation response:", e)
        return Evaluation(is_acceptable=False, feedback="Could not parse the evaluation response.")



### ✅ Why `temperature=0.0` is ideal for evaluation:

When you're **evaluating a response for acceptability**, you want the model to behave:

* **Deterministically** (always return the same result for the same input),
* **Conservatively** (avoid creative or overly generous answers),
* **Precisely** (conform tightly to your expected format, e.g., `{"is_acceptable": true/false, "feedback": "..."}`).

Setting `temperature=0.0` makes the model:

* **Less random** — it picks the most probable output.
* **More consistent** — ideal for structured, rule-based tasks like evaluation.

---

### When to *increase* temperature:

* You're generating creative content (e.g., marketing copy, stories).
* You want a variety of phrasings or ideas.
* You're okay with looser structure or some variability.

---

So yes — for your `Evaluation` agent, where the model is playing the role of a **structured reviewer**, `temperature=0.0` is the best choice.


In [48]:
import textwrap

# Step 1: Compose message history and user input
history = []  # can be extended with prior exchanges
message = "Why is it important to start with evaluations when adopting AI in a company?"
# message = "What does OpenAI recommend for getting started with AI in an enterprise setting?"

# Step 2: Compose the final prompt payload
messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    temperature=0.3
)

# Format the output for readability
print("💬 Agent Reply:\n")
print(textwrap.fill(reply, width=80))

💬 Agent Reply:

Starting with evaluations is important when adopting AI in a company because it
provides a systematic process to measure how AI models perform against specific
use cases. Evaluations help ensure quality and safety by continuously improving
AI-enabled processes through expert feedback. This rigorous, structured approach
allows organizations to validate and test the outputs of their models, leading
to more stable and reliable applications that are resilient to change. By
conducting evaluations, companies can gain confidence in their AI initiatives
and make informed decisions about rolling out use cases into production.


In [49]:
# Evaluate the reply
evaluation = evaluate(reply, message, history)

# Print evaluation result with wrapped feedback
print("\n🧪 Evaluation Result:")
print("✅ Acceptable:" if evaluation.is_acceptable else "❌ Not acceptable.")

wrapped_feedback = textwrap.fill(evaluation.feedback, width=80)
print("💬 Feedback:\n" + wrapped_feedback)


🧪 Evaluation Result:
✅ Acceptable:
💬 Feedback:
The Agent's response accurately reflects the content of the report regarding the
importance of starting with evaluations in AI adoption. It clearly explains the
benefits of a systematic evaluation process, including ensuring quality, safety,
and continuous improvement, which aligns with the report's emphasis on rigorous
evaluations leading to stable and reliable applications. The response is clear,
professional, and would be helpful to a business or technical decision-maker.


In [51]:
def chat(message, history):
    # Use your full document system prompt
    system = system_prompt

    # Step 1: Compose full prompt with system, history, and user message
    messages = [{"role": "system", "content": system}] + history + [{"role": "user", "content": message}]

    # Step 2: Get agent reply
    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    reply = response.choices[0].message.content.strip()

    # Step 3: Evaluate reply quality
    evaluation = evaluate(reply, message, history)

    # Step 4: Retry if response is unacceptable
    if evaluation.is_acceptable:
        print("✅ Passed evaluation – returning reply")
    else:
        print("❌ Failed evaluation – retrying...")
        print("💬 Feedback:", evaluation.feedback)
        reply = rerun(reply, message, history, evaluation.feedback)

    return reply

def rerun(previous_reply, message, history, feedback):
    correction_prompt = f"The last reply was not acceptable because: {feedback}. Please revise and improve it."
    updated_history = history + [{"role": "assistant", "content": previous_reply}]

    messages = [{"role": "system", "content": system_prompt}] + updated_history + [{"role": "user", "content": correction_prompt}]

    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content.strip()

In [52]:


history = []
message = "Why is it important to start with evaluations when adopting AI in a company?"
response = chat(message, history)
print("\n📥 Final Agent Reply:\n" + "-"*60)
print(textwrap.fill(response, width=70))



✅ Passed evaluation – returning reply

📥 Final Agent Reply:
------------------------------------------------------------
Starting with evaluations is important when adopting AI in a company
because evaluations provide a systematic process to measure how AI
models perform against specific use cases. They involve rigorous
testing that helps to ensure quality and safety in the implementation
of AI. By conducting evaluations, companies can continuously improve
AI-enabled processes based on expert feedback, leading to more stable
and reliable applications.  For example, Morgan Stanley employed
intensive evaluations to assess how AI could enhance the effectiveness
of their financial advisors. This approach built confidence in rolling
out AI applications and resulted in significant improvements, such as
increased access to documents and reduced search times, ultimately
enhancing client engagement and service. Thus, evaluations are a
crucial first step to enable successful AI integration and o

### ✅ What Just Happened

1. **Agent Response Generation**

   * The agent pulled a detailed, contextually rich answer **directly from your document** (`AI in the Enterprise`).
   * It included both **general principles** and a **real-world case study (Morgan Stanley)** — showing that the model is grounding its response in the source material.

2. **Pydantic Evaluation**

   * Your evaluator model reviewed the agent’s reply and confirmed it was:

     * Relevant
     * Accurate
     * Professional
   * The evaluation returned `✅ Acceptable`, meaning no rerun was needed.

3. **Final Presentation**

   * The wrapped output makes the response **easy to read and presentable**, perfect for embedding in apps, dashboards, or client tools.

---

### 🧠 Why This Matters

You've built a **lightweight Retrieval QA agent** that:

* Represents a real company (OpenAI)
* Operates on **real enterprise guidance**
* Can respond interactively and **self-assess its responses**
* Provides a feedback loop via **automated evaluation**
* Can scale to other PDFs or domains with minor tweaks

---

### ✅ You Are Now Ready To…

* Swap in other corporate reports or whitepapers
* Integrate this into a Gradio or Streamlit app
* Connect multiple agents (e.g., insights + evaluator + dashboard)
* Automate follow-ups or notifications (e.g., email results or trigger workflows)




This single-document QA agent (a "focused RAG-lite assistant") has powerful applications in **business scenarios** where people need accurate, conversational access to specific documents. Here are several **real-world use cases**:

---

### 🧾 1. **Policy & Compliance Assistants**

**Use case:** Help employees or partners understand lengthy legal or compliance documents.

* **Example:** “What are the guidelines for data retention under our policy?”
* **Source doc:** Company data privacy policy, employee handbook, or compliance manual
* **Benefit:** Reduces legal exposure and improves policy comprehension

---

### 📄 2. **Product Manual or Technical Guide Q\&A**

**Use case:** Allow customers or support reps to ask questions about complex products.

* **Example:** “How do I reset the device to factory settings?”
* **Source doc:** Product manual or installation guide
* **Benefit:** Reduces support load, improves customer experience

---

### 💼 3. **Internal Knowledge Assistants**

**Use case:** Give employees conversational access to strategy docs, training materials, etc.

* **Example:** “What are our goals for Q3 according to the leadership playbook?”
* **Source doc:** Strategy memo, training deck, OKR document
* **Benefit:** Faster onboarding, better alignment

---

### 📊 4. **Research Report Explainers**

**Use case:** Let stakeholders ask questions about a market research report or whitepaper.

* **Example:** “What trends are mentioned in the Asia-Pacific region?”
* **Source doc:** Industry report, investment brief, analyst whitepaper
* **Benefit:** Increases the utility and reach of high-value research

---

### 📃 5. **RFP / Proposal Q\&A Agents**

**Use case:** Help teams prepare or review large RFP responses.

* **Example:** “Do we meet the requirement for cybersecurity certifications?”
* **Source doc:** 100-page RFP response PDF
* **Benefit:** Saves hours of review time and reduces errors

---

### 🏛️ 6. **Public Sector Transparency**

**Use case:** Citizens ask questions about legislation, budgets, or government reports.

* **Example:** “What is the allocated budget for renewable energy in this bill?”
* **Source doc:** City or national legislation PDF
* **Benefit:** Promotes accountability and citizen engagement

---

### ✍️ Bonus: Combine with Feedback Loop

You could combine this setup with:

* ✅ Evaluation agent (already done)
* 📨 Email summaries to stakeholders
* 📊 Visual dashboards

To turn this into a full **information agent pipeline.**


