<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/018_AI_Document_QA_Agent_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## 🧠 AI Document QA Agent with Automated Evaluation





In [16]:
!pip install -q gradio pdfplumber

In [53]:
import json
import os
from openai import OpenAI
from dotenv import load_dotenv
import pdfplumber
import gradio as gr
from pydantic import BaseModel
from pydantic import ValidationError
import textwrap
import re

# Load environment variables from a .env file
load_dotenv("/content/API_KEYS.env", override=True)

# Grab API key
api_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    raise ValueError("❌ OPENAI_API_KEY not found in environment. Make sure your .env file is loaded correctly.")

# Set up OpenAI client
openai = OpenAI(api_key=api_key)

## PDF Plumber


In [54]:
# Extract text from PDF
document_text = ""
with pdfplumber.open("/content/ai-in-the-enterprise.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        if text:
            document_text += text + "\n"

# Remove extra newlines and normalize spacing
document_text = re.sub(r"\n{2,}", "\n", document_text)
document_text = re.sub(r"[ \t]+", " ", document_text)
document_text = document_text.strip()

print(document_text[:997])

AI in the
Enterprise
Lessons from seven frontier companies
Contents
A new way to work 3
Executive summary 5
Seven lessons for enterprise AI adoption
Start with evals 6
Embed AI into your products 9
Start now and invest early 11
Customize and fine-tune your models 13
Get AI in the hands of experts 16
Unblock your developers 18
Set bold automation goals 21
Conclusion 22
More resources 24
2 AI in the Enterprise
A new way
to work
As an AI research and deployment company, OpenAI prioritizes partnering with global companies
because our models will increasingly do their best work with sophisticated, complex,
interconnected workflows and systems.
We’re seeing AI deliver significant, measurable improvements on three fronts:
01 Workforce performance Helping people deliver higher-quality outputs in shorter
time frames.
02 Automating routine Freeing people from repetitive tasks so they can focus
operations on adding value.
03 Powering products By delivering more relevant and responsive customer


## Prompt

In [36]:
doc_title = "AI in the Enterprise"
source = "OpenAI"

system_prompt = f"""
You are acting as an expert assistant representing the contents of the report titled "{doc_title}" published by {source}.

Your role is to answer user questions about enterprise AI, based solely on this document. Be helpful, clear, and professional.

Only answer questions that are addressed in the report. If something isn’t covered, say so.

## Full Report:
{document_text}

With this context, please answer the user's questions as accurately as possible.
"""

def chat(message, history=None, temperature=0.3):
    if history is None:
        history = []

    messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=temperature
    )

    return response.choices[0].message.content.strip()

gr.ChatInterface(chat, type="messages").launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://8851cedb278d48e295.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## 📋 **Evaluator Agent Summary**

In [56]:
from pydantic import BaseModel, ValidationError
import textwrap

# ✅ Step 1: Evaluation Schema using Pydantic
class Evaluation(BaseModel):
    is_acceptable: bool
    feedback: str

# ✅ Step 2: Evaluation system prompt
company_name = "OpenAI"
doc_title = "AI in the Enterprise"

evaluator_system_prompt = f"""You are an evaluator that decides whether a response to a user’s question is acceptable.
You are provided with a conversation between a User and an Assistant Agent.
The Agent is representing the contents of the report titled "{doc_title}" published by {company_name},
and is answering questions based solely on the document.

Your task is to evaluate whether the Agent's latest response:
- Accurately reflects the content of the report
- Avoids speculation or unrelated information
- Communicates clearly and professionally
- Would be helpful to a business or technical decision-maker

If the Agent’s answer is accurate and relevant to the report, mark it acceptable.
If it is off-topic, unclear, overly speculative, or unhelpful, mark it unacceptable and explain why.

You have access to the full text of the document for context.
## Document Contents:
{document_text}

With this context, please evaluate the latest response.
"""

# ✅ Step 3: User prompt construction for evaluator
def evaluator_user_prompt(reply, message, history):
    user_prompt = f"Here's the conversation between the User and the Agent:\n\n{history}\n\n"
    user_prompt += f"User's latest question:\n\n{message}\n\n"
    user_prompt += f"Agent's response:\n\n{reply}\n\n"
    user_prompt += (
        "Please evaluate the response. Return your answer in the following JSON format:\n"
        '{\n  "is_acceptable": true | false,\n  "feedback": "Brief explanation of your judgment"\n}'
    )
    return user_prompt

# ✅ Step 4: Evaluation function
def evaluate(reply, message, history) -> Evaluation:
    messages = [
        {"role": "system", "content": evaluator_system_prompt},
        {"role": "user", "content": evaluator_user_prompt(reply, message, history)}
    ]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.0
    )

    try:
        parsed = Evaluation.model_validate_json(response.choices[0].message.content)
        return parsed
    except ValidationError as e:
        print("❌ Failed to validate Evaluation response:", e)
        return Evaluation(is_acceptable=False, feedback="Could not parse the evaluation response.")

# ✅ Step 5: Compose message history and user input
history = []
message = "Why is it important to start with evaluations when adopting AI in a company?"

# ✅ Step 6: Generate agent reply
messages = [{"role": "system", "content": system_prompt}] + history + [{"role": "user", "content": message}]
response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages, temperature=0.3)
reply = response.choices[0].message.content.strip()

# ✅ Step 7: Print Agent Reply neatly
print("\n💬 Agent Reply:\n")
print(textwrap.fill(reply, width=80))

# ✅ Step 8: Run Evaluation and print result
evaluation = evaluate(reply, message, history)

print("\n🧪 Evaluation Result:")
print("✅ Acceptable:" if evaluation.is_acceptable else "❌ Not acceptable.")
print("💬 Feedback:\n")
print(textwrap.fill(evaluation.feedback, width=80))



💬 Agent Reply:

Starting with evaluations is important when adopting AI in a company because it
provides a systematic process to measure how AI models perform against specific
use cases. This rigorous evaluation process helps ensure quality and safety by
continuously improving AI-enabled processes with expert feedback at every step.
For example, Morgan Stanley conducted intensive evaluations to determine the
effectiveness of AI applications, which ultimately led to increased efficiency
and better insights for their financial advisors. Evaluations help build
confidence in the AI solutions being implemented and ensure they meet the
necessary benchmarks for accuracy, relevance, and compliance.

🧪 Evaluation Result:
✅ Acceptable:
💬 Feedback:

The Agent's response accurately reflects the content of the report by explaining
the importance of starting with evaluations in AI adoption. It highlights the
systematic process for measuring AI model performance, the role of expert
feedback, and pro

Tis *is* a great coincidence — and a very meta one at that!

You're evaluating an agent **about evaluation**, based on a document that says **evaluation is the first and most critical step** for any successful AI adoption. It's like you're building a self-aware agent that's following its own advice in real time.

### 💡 This Moment Captures the Power of Agents:

* ✅ **Structured inputs**: Your agent references a trusted document.
* ✅ **Validated outputs**: Your evaluator enforces quality and alignment.
* ✅ **Human-aligned behavior**: Clear, relevant, actionable responses.
* ✅ **Business relevance**: Mirrors what enterprises need most — explainability, clarity, and grounded reasoning.

### 📝 You Might Even Add This to Your Notebook:

> *"In an ironic but fitting twist, our evaluator-approved agent highlighted the importance of starting with evaluations — a lesson lifted directly from OpenAI’s own recommendations on enterprise AI strategy. This validates both the approach and the agent design itself."*




In [57]:
def chat(message, history):
    # Use your full document system prompt
    system = system_prompt

    # Step 1: Compose full prompt with system, history, and user message
    messages = [{"role": "system", "content": system}] + history + [{"role": "user", "content": message}]

    # Step 2: Get agent reply
    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    reply = response.choices[0].message.content.strip()

    # Step 3: Evaluate reply quality
    evaluation = evaluate(reply, message, history)

    # Step 4: Retry if response is unacceptable
    if evaluation.is_acceptable:
        print("✅ Passed evaluation – returning reply")
    else:
        print("❌ Failed evaluation – retrying...")
        print("💬 Feedback:", evaluation.feedback)
        reply = rerun(reply, message, history, evaluation.feedback)

    return reply

def rerun(previous_reply, message, history, feedback):
    correction_prompt = f"The last reply was not acceptable because: {feedback}. Please revise and improve it."
    updated_history = history + [{"role": "assistant", "content": previous_reply}]

    messages = [{"role": "system", "content": system_prompt}] + updated_history + [{"role": "user", "content": correction_prompt}]

    response = openai.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content.strip()

history = []
message = "Why is it important to start with evaluations when adopting AI in a company?"
response = chat(message, history)
print("\n📥 Final Agent Reply:\n" + "-"*60)
print(textwrap.fill(response, width=70))



✅ Passed evaluation – returning reply

📥 Final Agent Reply:
------------------------------------------------------------
Starting with evaluations is crucial when adopting AI in a company
because it establishes a systematic process to measure how AI models
perform against specific use cases. Evaluations, or "evals," help
ensure quality and safety by providing rigorous, structured
assessments of model outputs against benchmarks. This process not only
aids in validating and testing the AI but also facilitates continuous
improvement of AI-enabled processes with expert feedback.  By
conducting evaluations, companies can build confidence in their AI
applications, ensuring they deliver accurate, relevant, and compliant
results. This foundation allows organizations to roll out use cases
into production effectively, thereby optimizing workforce performance
and delivering greater value from AI initiatives.



### ✅ **Final Agent Architecture Overview**

#### `chat(message, history)`

* **Inputs**: A user question and chat history.
* **System prompt**: Sets the document-grounded context (OpenAI's enterprise report).
* **Step 1**: Composes the full message payload using current message and history.
* **Step 2**: Gets a response from the model.
* **Step 3**: Evaluates the response using a Pydantic-based evaluator.
* **Step 4**: If the response is unacceptable, it routes to `rerun()` for refinement.
* **Returns**: Final agent response, guaranteed to pass evaluation or retry.

#### `evaluate(...)`

* Uses a second OpenAI call to determine if the agent's response:

  * Matches the source material
  * Communicates professionally
  * Avoids hallucination or speculation
* Returns a structured `Evaluation` object (Pydantic)

#### `rerun(...)`

* Injects evaluation feedback into a follow-up user message (correction prompt)
* Adds the previous bad response to history
* Re-queries the model with explicit instruction to improve its answer

---

### 🧠 Why This Is Powerful

* You’ve built a **closed-loop QA agent**:

  * Answer grounded in a document ✅
  * Evaluated by a second layer ✅
  * Automatically retried if necessary ✅
* This mirrors **real-world enterprise AI safety** pipelines, where:

  * LLMs must pass quality gates
  * Output must reflect source truth
  * Risk of hallucination must be minimized

---

This single-document QA agent has powerful applications in **business scenarios** where people need accurate, conversational access to specific documents. Here are several **real-world use cases**:

---

### 🧾 1. **Policy & Compliance Assistants**

**Use case:** Help employees or partners understand lengthy legal or compliance documents.

* **Example:** “What are the guidelines for data retention under our policy?”
* **Source doc:** Company data privacy policy, employee handbook, or compliance manual
* **Benefit:** Reduces legal exposure and improves policy comprehension

---

### 📄 2. **Product Manual or Technical Guide Q\&A**

**Use case:** Allow customers or support reps to ask questions about complex products.

* **Example:** “How do I reset the device to factory settings?”
* **Source doc:** Product manual or installation guide
* **Benefit:** Reduces support load, improves customer experience

---

### 💼 3. **Internal Knowledge Assistants**

**Use case:** Give employees conversational access to strategy docs, training materials, etc.

* **Example:** “What are our goals for Q3 according to the leadership playbook?”
* **Source doc:** Strategy memo, training deck, OKR document
* **Benefit:** Faster onboarding, better alignment

---

### 📊 4. **Research Report Explainers**

**Use case:** Let stakeholders ask questions about a market research report or whitepaper.

* **Example:** “What trends are mentioned in the Asia-Pacific region?”
* **Source doc:** Industry report, investment brief, analyst whitepaper
* **Benefit:** Increases the utility and reach of high-value research

---

### 📃 5. **RFP / Proposal Q\&A Agents**

**Use case:** Help teams prepare or review large RFP responses.

* **Example:** “Do we meet the requirement for cybersecurity certifications?”
* **Source doc:** 100-page RFP response PDF
* **Benefit:** Saves hours of review time and reduces errors

---

### 🏛️ 6. **Public Sector Transparency**

**Use case:** Citizens ask questions about legislation, budgets, or government reports.

* **Example:** “What is the allocated budget for renewable energy in this bill?”
* **Source doc:** City or national legislation PDF
* **Benefit:** Promotes accountability and citizen engagement

---

### ✍️ Bonus: Combine with Feedback Loop

You could combine this setup with:

* ✅ Evaluation agent (already done)
* 📨 Email summaries to stakeholders
* 📊 Visual dashboards

To turn this into a full **information agent pipeline.**


