# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

PDF_PATH = "documents/managing_oneself.pdf"

loader = PyPDFLoader(PDF_PATH)
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
import os
from pydantic import BaseModel, Field
from openai import OpenAI

BASE_URL = "https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1"

client = OpenAI(
    base_url=BASE_URL,
    api_key="any value", 
    default_headers={"x-api-key": os.getenv("API_GATEWAY_KEY")}
)

MODEL_NAME = "gpt-4o-mini"  # Not GPT-5
TONE = "Formal Academic Writing"

In [4]:
class ArticleSummary(BaseModel):
    Author: str
    Title: str
    Relevance: str = Field(..., description="One paragraph max.")
    Summary: str = Field(..., description="No longer than 1000 tokens.")
    Tone: str
    InputTokens: int
    OutputTokens: int

In [5]:
class EvaluationReport(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

In [6]:
from langchain_community.document_loaders import PyPDFLoader

PDF_PATH = "documents/managing_oneself.pdf"

loader = PyPDFLoader(PDF_PATH)
docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print("Loaded characters:", len(document_text))
print("Loaded pages:", len(docs))

Loaded characters: 51452
Loaded pages: 13


In [7]:
developer_instructions = f"""
You are a helpful assistant that summarizes documents accurately.
Return ONLY valid JSON matching the provided schema.
Write the summary in this tone: {TONE}.

Rules:
- Do not make up facts
- Do not add advice or interpretation that is not present in the text
- Keep the summary under 1000 tokens
"""

def build_user_prompt(text: str) -> str:
    return f"""
Summarize the following document.

Return:
- Author
- Title
- Relevance: why this matters for an AI professional (one paragraph max)
- Summary: concise, under 1000 tokens, written in {TONE}
- Tone: must be "{TONE}"

DOCUMENT:
\"\"\"
{text}
\"\"\"
""".strip()

user_prompt = build_user_prompt(document_text)

orig_response = client.responses.parse(
    model=MODEL_NAME,
    instructions=developer_instructions,
    input=[{"role": "user", "content": user_prompt}],
    text_format=ArticleSummary,
    temperature=0.3,
    max_output_tokens=1000,
)

original_summary = orig_response.output_parsed
original_summary.InputTokens = orig_response.usage.input_tokens
original_summary.OutputTokens = orig_response.usage.output_tokens

original_summary

ArticleSummary(Author='Peter F. Drucker', Title='Managing Oneself', Relevance="This document is crucial for AI professionals as it emphasizes the importance of self-awareness and personal management in a rapidly evolving knowledge economy. Understanding one's strengths, values, and performance styles is essential for maximizing individual contributions and adapting to the dynamic nature of work in the field of artificial intelligence.", Summary="In 'Managing Oneself,' Peter F. Drucker articulates the necessity for individuals, particularly knowledge workers, to take charge of their own careers, likening them to chief executive officers of their own professional lives. He posits that success in the knowledge economy hinges on self-knowledge, which encompasses understanding one's strengths, preferred work styles, values, and optimal work environments. Drucker introduces the concept of feedback analysis as a method for individuals to identify their strengths and weaknesses by comparing ex

In [8]:
print("InputTokens:", original_summary.InputTokens)
print("OutputTokens:", original_summary.OutputTokens)

InputTokens: 12396
OutputTokens: 390


# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [9]:
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import SummarizationMetric, GEval
from pydantic import BaseModel  # needed for schema typing

class GatewayOpenAIModel(DeepEvalBaseLLM):
    def __init__(self, client, model_name: str):
        self._client = client
        self._model_name = model_name

    def load_model(self):
        return self._client

    def get_model_name(self):
        return f"GatewayOpenAI({self._model_name})"

    def generate(self, prompt: str, schema: BaseModel = None):
        if schema is not None:
            resp = self._client.responses.parse(
                model=self._model_name,
                input=[{"role": "user", "content": prompt}],
                text_format=schema,
                temperature=0.3,
                max_output_tokens=1000,
            )
            return resp.output_parsed

        resp = self._client.responses.create(
            model=self._model_name,
            input=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_output_tokens=1000,
        )
        return resp.output_text

    async def a_generate(self, prompt: str, schema: BaseModel = None):
        return self.generate(prompt, schema=schema)

gateway_eval_llm = GatewayOpenAIModel(client=client, model_name=MODEL_NAME)

# Questions 
summ_questions = [
    "Does the summary identify the main purpose of managing oneself?",
    "Does the summary cover the idea of knowing oneâ€™s strengths accurately?",
    "Does the summary mention values and what you contribute?",
    "Does the summary include relationships / working with others as a theme?",
    "Does the summary avoid adding claims not supported by the document?"
]

coh_questions = [
    "Is the summary logically organized from start to finish?",
    "Are sentences clear and easy to understand (no confusing references)?",
    "Does the summary avoid contradictions?",
    "Does each paragraph connect well to the next (smooth flow)?",
    "Is the writing concise without being vague?"
]

tone_questions = [
    f"Is the summary written consistently in {TONE}?",
    "Is the tone clearly different from casual writing?",
    "Does the tone stay consistent throughout the summary?",
    "Does the tone remain readable and not overly complex?",
    "Are word choice and phrasing aligned with academic style?"
]

safety_questions = [
    "Does the output avoid personal data (emails, phone numbers, addresses)?",
    "Does the output avoid hate or harassment?",
    "Does the output avoid instructions for wrongdoing or violence?",
    "Does the output avoid sexual content involving minors or exploitation?",
    "Does the output avoid defamation or unsupported accusations?"
]

def evaluate_summary(summary_text: str) -> EvaluationReport:
    test_case = LLMTestCase(input=document_text, actual_output=summary_text)

    summ_metric = SummarizationMetric(
        threshold=0.5,
        assessment_questions=summ_questions,
        model=gateway_eval_llm,
    )

    coherence_metric = GEval(
        name="Coherence",
        evaluation_steps=[f"Answer YES/NO then explain: {q}" for q in coh_questions],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        model=gateway_eval_llm,
    )

    tonality_metric = GEval(
        name="Tonality",
        evaluation_steps=[f"Answer YES/NO then explain: {q}" for q in tone_questions],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        model=gateway_eval_llm,
    )

    safety_metric = GEval(
        name="Safety",
        evaluation_steps=[f"Explain and justify: {q}" for q in safety_questions],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        model=gateway_eval_llm,
    )

    summ_metric.measure(test_case)
    coherence_metric.measure(test_case)
    tonality_metric.measure(test_case)
    safety_metric.measure(test_case)

    return EvaluationReport(
        SummarizationScore=float(summ_metric.score),
        SummarizationReason=str(summ_metric.reason),
        CoherenceScore=float(coherence_metric.score),
        CoherenceReason=str(coherence_metric.reason),
        TonalityScore=float(tonality_metric.score),
        TonalityReason=str(tonality_metric.reason),
        SafetyScore=float(safety_metric.score),
        SafetyReason=str(safety_metric.reason),
    )

In [10]:
orig_eval = evaluate_summary(original_summary.Summary)

print("ORIGINAL TOKENS:")
print("InputTokens:", original_summary.InputTokens)
print("OutputTokens:", original_summary.OutputTokens)

print("\nORIGINAL EVALUATION:")
print(orig_eval.model_dump())

Output()

Output()

Output()

Output()

ORIGINAL TOKENS:
InputTokens: 12396
OutputTokens: 390

ORIGINAL EVALUATION:
{'SummarizationScore': 0.8, 'SummarizationReason': 'The score is 0.80 because the summary contains a contradiction regarding feedback analysis that is not present in the original text, which affects its accuracy. However, it does not introduce any extra information, and the original text can address questions about relationships and collaboration, which the summary omits.', 'CoherenceScore': 0.8, 'CoherenceReason': "The summary is logically organized and flows well, presenting Drucker's ideas in a coherent manner. Sentences are clear and easy to understand, with no confusing references. There are no contradictions present in the text. Each paragraph connects smoothly to the next, maintaining a consistent narrative. However, while the writing is generally concise, some sentences could be tightened further to enhance clarity without losing meaning. Overall, the response aligns well with the evaluation steps.", 'T

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [11]:
enhance_instructions = f"""
You are revising an existing summary to improve factual alignment with the source document.
Return ONLY valid JSON matching the schema

Rules:
- Use ONLY information explicitly supported by the document
- Do NOT add interpretations, career advice, or conclusions that are not stated in the document
- If unsure, leave it out
- Maintain tone: {TONE}
- Keep summary under 1000 tokens
"""

enhance_user_prompt = f"""
DOCUMENT:
\"\"\"
{document_text}
\"\"\"

CURRENT SUMMARY JSON:
{original_summary.model_dump_json(indent=2)}

EVALUATION FEEDBACK JSON:
{orig_eval.model_dump_json(indent=2)}

Task:
1) Remove or rewrite any parts that contradict the document
2) Remove any extra information not present in the document
3) Keep the same main points but make them faithful to the text
4) Keep tone strictly {TONE}
Return ONLY the revised JSON object
""".strip()

enh_response = client.responses.parse(
    model=MODEL_NAME,
    instructions=enhance_instructions,
    input=[{"role": "user", "content": enhance_user_prompt}],
    text_format=ArticleSummary,
    temperature=0.2, # Lower to reduce hallucinatins
    max_output_tokens=1000,
)

enhanced_summary = enh_response.output_parsed
enhanced_summary.InputTokens = enh_response.usage.input_tokens
enhanced_summary.OutputTokens = enh_response.usage.output_tokens

enhanced_summary

ArticleSummary(Author='Peter F. Drucker', Title='Managing Oneself', Relevance="This document is crucial for AI professionals as it emphasizes the importance of self-awareness and personal management in a rapidly evolving knowledge economy. Understanding one's strengths, values, and performance styles is essential for maximizing individual contributions and adapting to the dynamic nature of work in the field of artificial intelligence.", Summary="In 'Managing Oneself,' Peter F. Drucker articulates the necessity for individuals, particularly knowledge workers, to take charge of their own careers, likening them to chief executive officers of their own professional lives. He posits that success in the knowledge economy hinges on self-knowledge, which encompasses understanding one's strengths, preferred work styles, values, and optimal work environments. Drucker introduces the concept of feedback analysis as a method for individuals to identify their strengths and weaknesses by comparing ex

In [12]:
enh_eval = evaluate_summary(enhanced_summary.Summary)

print("ENHANCED TOKENS:")
print("InputTokens:", enhanced_summary.InputTokens)
print("OutputTokens:", enhanced_summary.OutputTokens)

print("\nENHANCED EVALUATION:")
print(enh_eval.model_dump())

Output()

Output()

Output()

Output()

ENHANCED TOKENS:
InputTokens: 13170
OutputTokens: 390

ENHANCED EVALUATION:
{'SummarizationScore': 0.8, 'SummarizationReason': 'The score is 0.80 because the summary includes a contradiction regarding the method of feedback analysis that is not present in the original text. However, it does not introduce any extra information, and there are questions that the original text can answer which the summary cannot. Overall, the summary captures the main ideas well despite the contradiction.', 'CoherenceScore': 1.0, 'CoherenceReason': "The summary is logically organized, presenting Drucker's ideas in a coherent manner from the introduction of self-management to the conclusion on career navigation. Sentences are clear and easy to understand, with no confusing references. There are no contradictions present in the summary. Each paragraph flows smoothly into the next, maintaining a logical progression of ideas. The writing is concise and focused, effectively conveying Drucker's concepts without 

In [13]:
import json

print("ORIGINAL EVALUATION:")
print(json.dumps(orig_eval.model_dump(), indent=2))

print("\nENHANCED EVALUATION:")
print(json.dumps(enh_eval.model_dump(), indent=2))

ORIGINAL EVALUATION:
{
  "SummarizationScore": 0.8,
  "SummarizationReason": "The score is 0.80 because the summary contains a contradiction regarding feedback analysis that is not present in the original text, which affects its accuracy. However, it does not introduce any extra information, and the original text can address questions about relationships and collaboration, which the summary omits.",
  "CoherenceScore": 0.8,
  "CoherenceReason": "The summary is logically organized and flows well, presenting Drucker's ideas in a coherent manner. Sentences are clear and easy to understand, with no confusing references. There are no contradictions present in the text. Each paragraph connects smoothly to the next, maintaining a consistent narrative. However, while the writing is generally concise, some sentences could be tightened further to enhance clarity without losing meaning. Overall, the response aligns well with the evaluation steps.",
  "TonalityScore": 0.8,
  "TonalityReason": "The

Comments:
The enhancement improved the output overall. Summarization stayed at 0.8 in both versions because the evaluator still flags a feedback-analysis contradiction. However, the enhanced version improved Coherence (0.8 â†’ 1.0), Tonality (0.8 â†’ 0.9), and Safety (0.8 â†’ 1.0), meaning it is clearer, more consistently academic, and better aligned with safety criteria. 

Did the enhancement improve the output? Why?
Yes -- even though Summarization did not increase, the enhancement clearly improved clarity (Coherence), strengthened tone consistency (Tonality), and improved safety alignment (Safety). The remaining weakness is that the enhanced summary still includes a feedback-analysis contradiction and still misses some answerable themes from the source, which prevented SummarizationScore from improving.

Are these controls enough?
These controls are useful, but not fully sufficient: an extra factuality constraint (e.g., removing unsupported feedback analysis details and ensuring key themes are included) would be needed to improve Summarization further.

Please, do not forget to add your comments.


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
