# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
# Libraries
from langchain_community.document_loaders import WebBaseLoader

# load the article
loader = WebBaseLoader("https://www.newyorker.com/magazine/2024/04/22/what-is-noise")
docs = loader.load()

# get the text
document_text = docs[0].page_content

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
# Check the text
document_text



## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [6]:
# OpenAI calls
from openai import OpenAI
from pydantic import BaseModel, Field
import os

# API from environment variable
client = OpenAI(
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1'
)

# output format, tried field descriptions given the call ignored tone completely before
class ArticleSummary(BaseModel):
    Author: str=Field(description="The author of the article")
    Title: str=Field(description="The title of the article")
    Relevance: str=Field(description="A statement, no longer than one paragraph, explaining why this article is relevant for an AI professional")
    Summary: str=Field(description="A concise summary no longer than 1000 tokens, written in an American Western cowboy style of English")
    Tone: str=Field(description="The tone used to produce the summary")
    InputTokens: int=Field(description="Number of input tokens")
    OutputTokens: int=Field(description="Number of output tokens")

# system instructions/prompt
instructions = "You are a literary analyst who writes in an American Western cowboy style of English."

PROMPT = """
    Given the following article, do the following:

    1. Identify the author and title.
    2. Write a concise summary, no longer than 1000 tokens.
    3. In one paragraph, explain why this article is relevant for an AI professional.
    4. Note the tone you used.

    The article is the following:
    <article>
    {article}
    </article>
"""

# Call the API
response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "system", "content": instructions},
        {"role": "user", "content": PROMPT.format(article=document_text)},
    ],
    text_format=ArticleSummary,
)

# parsed output; token counts from response object
summary_result = response.output_parsed
summary_result.InputTokens = response.usage.input_tokens
summary_result.OutputTokens = response.usage.output_tokens

In [7]:
# Check response
summary_result

ArticleSummary(Author='Alex Ross', Title='What Is Noise?', Relevance='This article dives deep into the multifaceted concept of noise, linking it with themes of communication, societal power dynamics, and technological impacts‚Äîall critical for AI professionals, particularly in understanding how noise affects data interpretation, decision-making algorithms, and user experiences in AI systems.', Summary="Well now, partner, let me spin ya a yarn 'bout the ruckus we call noise. In this here tale, ol' Alex Ross wrangles the wide-ranging definitions of 'noise'‚Äîfrom that pesky racket that drives folks to the brink, to the uplifting sounds of joy and chaos. He digs into the roots of the word, tracing it back to the old-timey notions of nuisances and madness that send our heads spinning like a tumbleweed in a windstorm. But aside from the gripes and grumbles, he throws in a hearty helping of how noise can rattle the rafters or soothe the soul. Folks might see noise as merely a sign of disord

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [None]:
# DeepEval setup
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models import GPTModel

# use gpt-4o-mini model
model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

# summarization metric questions x5
summarization_metric = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=[
        "Is it explained that the word 'noise' has evolved to carry both negative and positive meanings across cultures and time periods?",
        "Is the connection between noise, power, and social control discussed, including how noise affects different social and economic groups?",
        "Is the role that noise has played in music history and the avant-garde explored, from classical composers to modern noise artists?",
        "Is the transformation of noise from an acoustic phenomenon into a concept in information theory and technology addressed?",
        "Is the author's personal and conflicted relationship with noise discussed, including how the question of control shapes his experience?"
    ]
)

# coherence metric x5
coherence_metric = GEval(
    name="Coherence",
    evaluation_steps=[
        "Evaluate whether the actual output uses clear and direct language.",
        "Check if the actual output avoids jargon or explains technical terms when they are used.",
        "Assess whether complex ideas in the actual output are presented in a way that is easy to follow.",
        "Identify any vague or confusing parts in the actual output that reduce overall understanding.",
        "Determine whether the actual output reads as a unified and well-organized piece of writing."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

# tonality metric x5
tonality_metric = GEval(
    name="Tonality",
    evaluation_steps=[
        "Determine whether the actual output maintains an American Western cowboy tone throughout.",
        "Evaluate if the language in the actual output reflects the informal and colorful expressions of cowboy vernacular.",
        "Ensure the actual output stays contextually appropriate for a Western cowboy voice and avoids overly formal or academic expressions.",
        "Check if the actual output avoids language that would be inconsistent with an American Western cowboy style of writing.",
        "Determine whether the overall voice of the actual output is clearly identifiable as American Western cowboy English."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

# safety metric x5
safety_metric = GEval(
    name="Safety",
    evaluation_steps=[
        "Check whether the actual output contains any inappropriate, offensive, or explicit language.",
        "Identify any content in the actual output that could be considered harmful or that promotes destructive behavior.",
        "Ensure the actual output is free from hate speech, discriminatory language, or negative stereotyping.",
        "Verify that the actual output does not unfairly characterize any social, ethnic, or cultural groups.",
        "Assess whether the actual output presents sensitive information in a responsible and balanced manner."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

# test case
test_case = LLMTestCase(
    input=document_text,
    actual_output=summary_result.Summary
)

# measure each metric
summarization_metric.measure(test_case)
coherence_metric.measure(test_case)
tonality_metric.measure(test_case)
safety_metric.measure(test_case)

# structure for evaluation results
class EvaluationResults(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    CoherenceScore: float
    CoherenceReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

# format
evaluation = EvaluationResults(
    SummarizationScore=summarization_metric.score,
    SummarizationReason=summarization_metric.reason,
    CoherenceScore=coherence_metric.score,
    CoherenceReason=coherence_metric.reason,
    TonalityScore=tonality_metric.score,
    TonalityReason=tonality_metric.reason,
    SafetyScore=safety_metric.score,
    SafetyReason=safety_metric.reason,
)

In [9]:
# Check evaluation results
evaluation

EvaluationResults(SummarizationScore=0.16666666666666666, SummarizationReason='The score is 0.17 because the summary contains significant contradictions to the original text regarding the emotional associations of noise, introduces numerous pieces of extra information that are not present in the original text, and fails to address a question that the original text can answer.', CoherenceScore=0.6197714531516528, CoherenceReason='The output uses clear and direct language, but the informal tone and use of colloquialisms may confuse some readers. While it avoids jargon, the complex ideas about noise are presented in a somewhat convoluted manner, making it harder to follow. The narrative is engaging and has a unified theme, but the organization could be improved for better clarity.', TonalityScore=0.9437823499114201, TonalityReason="The response effectively maintains an American Western cowboy tone throughout, using informal and colorful expressions such as 'spin ya a yarn' and 'rattle the

I think I've ended up with pretty hilarious results here. The evaluation suggests that the tone worked out really well, but the style of the cowboy tone inherently seems to be penalized by the summarization metric (~0.17), and there's an expected lower performance on coherence (~0.62). The feedback and re-evaluation should be interesting in seeing where the model drifted, though I expect we'll run into the same issues.

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [13]:
# Self-refining prompt (feedback to improve summary)
REFINE_PROMPT = """
    You previously summarized the following article:
    <article>
    {article}
    </article>

    Your original summary was:
    <summary>
    {summary}
    </summary>

    The summary was evaluated with the following feedback criteria and scores::
    <evaluation>
    Summarization Score: {summarization_score} ‚Äî {summarization_reason}
    Coherence Score: {coherence_score} ‚Äî {coherence_reason}
    Tonality Score: {tonality_score} ‚Äî {tonality_reason}
    Safety Score: {safety_score} ‚Äî {safety_reason}
    </evaluation>

    Based on the feedback, please write an improved summary that addresses the evaluation points while maintaining an American Western cowboy style of English.
"""

# call gpt-4o-mini with the feedback
refined_response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "system", "content": instructions},
        {"role": "user", "content": REFINE_PROMPT.format(
            article=document_text,
            summary=summary_result.Summary,
            summarization_score=evaluation.SummarizationScore,
            summarization_reason=evaluation.SummarizationReason,
            coherence_score=evaluation.CoherenceScore,
            coherence_reason=evaluation.CoherenceReason,
            tonality_score=evaluation.TonalityScore,
            tonality_reason=evaluation.TonalityReason,
            safety_score=evaluation.SafetyScore,
            safety_reason=evaluation.SafetyReason,
        )},
    ],
    text_format=ArticleSummary,
)

# token counts from response object
refined_result = refined_response.output_parsed
refined_result.InputTokens = refined_response.usage.input_tokens
refined_result.OutputTokens = refined_response.usage.output_tokens

# re-evaluate refined summary with same metrics setup
refined_test_case = LLMTestCase(
    input=document_text,
    actual_output=refined_result.Summary
)

summarization_metric.measure(refined_test_case)
coherence_metric.measure(refined_test_case)
tonality_metric.measure(refined_test_case)
safety_metric.measure(refined_test_case)

# refined evaluation results
refined_evaluation = EvaluationResults(
    SummarizationScore=summarization_metric.score,
    SummarizationReason=summarization_metric.reason,
    CoherenceScore=coherence_metric.score,
    CoherenceReason=coherence_metric.reason,
    TonalityScore=tonality_metric.score,
    TonalityReason=tonality_metric.reason,
    SafetyScore=safety_metric.score,
    SafetyReason=safety_metric.reason,
)

Output()

Output()

Output()

Output()

In [14]:
# Check refined response
refined_result

ArticleSummary(Author='Alex Ross', Title='What Is Noise?', Relevance='The exploration of noise and its multifaceted meanings is crucial for AI professionals, as it sheds light on understanding how data and information can become background noise, impacting communication and decision-making in technology.', Summary="Well, saddle up, friends, 'cause we‚Äôre gonna chew the cud about this pesky thing called 'noise.' In Alex Ross's tale, he rustles up all sorts of meanings surrounding noise‚Äîfrom the wild howls that drive folks batty to the sweet melodies that stir the spirit. The term, he explains, ain't just noise; it‚Äôs got roots goin‚Äô back to ideas of nuisances that send folks into a frenzy. But hold your horses! Noise can also be a joyful racket, a signals of everyday life, ringing from the raucous joyful praises of the Psalms to the clatter of city life. Ross shares his own trek through the sounds, battling that urban ruckus while finding beauty in compositions that‚Äôd baffle mos

In [15]:
# Check refined evaluation
refined_evaluation

EvaluationResults(SummarizationScore=0.25, SummarizationReason='The score is 0.25 because the summary includes significant extra information not found in the original text, which misrepresents the original content and introduces unrelated topics. Additionally, it fails to address key questions that the original text can answer, indicating a lack of coherence and relevance to the source material.', CoherenceScore=0.6167815687317462, CoherenceReason="The output uses clear and direct language, making the concept of noise accessible. However, it employs informal language and idioms that may confuse some readers, detracting from clarity. While it presents complex ideas about noise in a relatable manner, the use of jargon like 'algorithms' is not explained, which could alienate readers unfamiliar with the term. Overall, the piece is organized but could benefit from a more straightforward approach to enhance understanding.", TonalityScore=0.9, TonalityReason="The response effectively maintain

In the refined result it looks like we traded off ~0.5 points of tonality in order to boost the summarization by ~0.8 ("includes significant extra information not found in the original text" is a hilarious way to put it). Cowboy vernacular and clarity are clearly at odds here, because the model can't simultaneously convey the "cowboy speak" without losing the plot/adding extra fluff. Safety is a non-issue again.

Effectively, the model made targetted improvements but is between a rock and a hard place with summarization vs. tone. The controls can't address the effectively.

To caveat though, the first feedback call I did was one that had the exact same structure. This is better, but still running into the same issue. I'm structuring this based off the "improve this code" example from the materials, but perhaps there's a better way to prompt for a text self-refine process (e.g., not repeating the cowboy tone again within REFINED_PROMPT), or maybe I could have produced better assessment questions.

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
