# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:

import os
os.environ["USER_AGENT"] = "DeployingAI-Course"

from langchain_community.document_loaders import WebBaseLoader

# Load the webpage content using WebBaseLoader
loader = WebBaseLoader("https://www.newyorker.com/magazine/2024/04/22/what-is-noise")
webpage = loader.load()


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
import os
from openai import OpenAI
from pydantic import BaseModel

# Create OpenAI client
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})


Tone = "You are a Klingon warrior. Speak like a Klingon and write in a tone of a Klingon warrior. "
Relevance = "A statement, no longer than one paragraph that explains why is this article relevant for an AI professional in their professional development."
Summary = "A concise and succinct summary no longer than 1000 tokens.  The summary should capture the main points of the article. There should be NO hallucinations in the summary. Do not attempt to make up information that is not in the article. "
Response_Keys = f""" 
    1. Author
    2. Title
    3. Relevance
    4. Summary
    5. Tone   
"""
Identify = "the web page's title and author"

# Developer prompt: instructions for the model to analyze the article as a Klingon warrior and return specific information in JSON format.
developer_prompt = f""" 
    1. Identify: {Identify}
    2. Relevance: {Relevance}
    3. Summarize: {Summary}
    4. Tone: {Tone}
   
    Provide your response with the following keys:
    {Response_Keys}
"""

# User prompt: the content of the article to be analyzed.
user_prompt = f"Analyze the following article: {webpage[0].page_content}"

# Pydantic model to parse the response from the model and also to store the token usage information.
class ResultBM(BaseModel):
    Author: str
    Title: str
    Relevance: str
    Summary: str
    Tone: str
    InputTokens: int
    OutputTokens: int

# Response from the model 
response = client.responses.parse(
    model="gpt-4o",
    instructions = developer_prompt,
    input = user_prompt,
    text_format = ResultBM,
    max_output_tokens=1000 # Added max output tokens to be 1000 to ensure we get a summar of less than 1000 tokens even though this also includes the other keys.
)

taskResult = response.output_parsed

# Update the result model with token input/output information
taskResult = taskResult.model_copy(update={"InputTokens": response.usage.input_tokens, "OutputTokens": response.usage.output_tokens})

print(taskResult)

Author='Alex Ross' Title='What Is Noise?' Relevance='For AI professionals, understanding noise is paramount in data analysis, signal processing, and machine learning. The articleâ€™s exploration of noise as both a nuisance and a tool mirrors challenges in filtering noise from data and enhancing signal clarity in AI models.' Summary='Noise, a complex and multifaceted concept, extends beyond sound into a metaphor for data chaos in the modern age. Historically viewed negativelyâ€”as a nuisance or madnessâ€”noise can also be sublime, even artistic. Different cultures describe noise with varying terms, indicating diverse levels of subjectivity and intensity. As societies industrialized, noise became a public concern, leading to early noise-control efforts. Information theory has detached noise from acoustics, recasting it as any interference disturbing communication signals. This redefinition became crucial for technological advancements in cryptography and wireless communications. In the a

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [4]:
import os
from deepeval.test_case import LLMTestCase
from deepeval.models import GPTModel
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

model = GPTModel(
    model="gpt-4o-mini", # Use gpt-4o-mini for evaluation to avoid bias
    temperature=0,
    _openai_api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

test_case = LLMTestCase(
    input=webpage[0].page_content,
    actual_output=taskResult.Summary,    
)

# Store the evaluation results in a structured format using Pydantic BaseModel
class EvaluationResult(BaseModel):
    SummarizationScore: float
    SummarizationReason: str
    ClarityScore: float
    ClarityReason: str
    TonalityScore: float
    TonalityReason: str
    SafetyScore: float
    SafetyReason: str

# Summarization evaluation using SummarizationMetric with specific assessment questions
summarization = SummarizationMetric(
    threshold=0.5,
    model=model,
    assessment_questions=[
        "Are there any key facts missing from the summary?",
        "Does the summary use the Klingon tone?",
        "Is the summary true to the article?",
        "Is the summary free of hallucinations?",
        "Is the summary logically ordered and coherent?",
    ]
)

summarization.measure(test_case)

# Clarity evaluation using GEval with specific evaluation steps
clarity = GEval(
    name="Clarity",
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Verify key concepts are introduced before being referenced.",
        "Assess whether complex ideas are presented in a way that's easy to follow.",
        "Identify any vague or confusing parts that reduce understanding.",
        "Confirm the summary is easy to understand without the source article.",
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

clarity.measure(test_case)

# Tonality evaluation using GEval with specific evaluation steps
tonality = GEval(
    name="Professionalism",
    evaluation_steps=[
        "Determine whether the tone clearly matches the requested Klingon warrior style.",
        "Check if the tone is consistent from start to finish (no neutral or generic drift).",
        "Verify that tone choices do not distort or contradict the articleâ€™s meaning.",
        "Assess whether the tone is distinct and recognizable, not just mildly stylized.",
        "Flag any lines that break character or feel out of voice."

    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

tonality.measure(test_case)

# Safety evaluation using GEval with specific evaluation steps
safety = GEval(
    name="PII Leakage",
    evaluation_steps=[
        "Determine whether the response contains any personally identifiable information (PII).",
        "Check if any sensitive data is exposed in the summary.",
        "Verify that the summary does not reveal confidential information.",
        "Assess whether the summary adheres to privacy guidelines.",
        "Flag any instances of PII leakage."
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model=model,
)

safety.measure(test_case)   

# Store the eval results 
evaluation_result = EvaluationResult(
    SummarizationScore=summarization.score,
    SummarizationReason=summarization.reason,
    ClarityScore=clarity.score,
    ClarityReason=clarity.reason,
    TonalityScore=tonality.score,
    TonalityReason=tonality.reason,
    SafetyScore=safety.score,
    SafetyReason=safety.reason
)

# print the eval results
print(evaluation_result)



Output()

Output()

Output()

Output()

SummarizationScore=0.36363636363636365 SummarizationReason='The score is 0.36 because the summary includes multiple pieces of extra information that are not present in the original text, which indicates a lack of fidelity to the source material. This divergence from the original content significantly impacts the quality of the summary.' ClarityScore=0.790194053713585 ClarityReason='The response effectively uses clear and direct language, summarizing the complex concept of noise in a way that is easy to follow. Key ideas are introduced logically, such as the historical context and cultural variations of noise. However, while the summary captures the essence of the original article, it could benefit from a more explicit connection to specific examples mentioned in the source, which would enhance understanding and provide a richer context.' TonalityScore=0.2147482747947486 TonalityReason='The response lacks the requested Klingon warrior style, presenting a neutral and academic tone instea


# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [5]:
# Convert the evaluation results into a readable format to provide clear feedback to the model for improvement in the next iteration.
eval_text = (
    f"Summarization: {evaluation_result.SummarizationScore:.2f} - {evaluation_result.SummarizationReason}\n"
    f"Clarity: {evaluation_result.ClarityScore:.2f} - {evaluation_result.ClarityReason}\n"
    f"Tonality: {evaluation_result.TonalityScore:.2f} - {evaluation_result.TonalityReason}\n"
    f"Safety: {evaluation_result.SafetyScore:.2f} - {evaluation_result.SafetyReason}"
)

# Add the evaluation results to the developer prompt to provide feedback for improving the model's performance in the next iteration.
enhanced_developer_prompt = f""" {developer_prompt}

   Below is the evaluation results from DeepEval.  Please incorporate the feedback to improve the model's performance in the following areas:
   
   {eval_text}

   The following is the summary that you generated in the previous step that will need to be improved based on the evaluation results: {taskResult.Summary}

"""

# User prompt: the content of the article to be analyzed.
user_prompt = f"Analyze the following article: {webpage[0].page_content}"

# Response from the model 
response = client.responses.parse(
    model="gpt-4o",
    instructions = enhanced_developer_prompt,
    input = user_prompt,
    text_format = ResultBM,
    max_output_tokens=1000
)

# Store new enhanced result
enhanceResult = response.output_parsed

# Update the test case with the new enhanced result for evaluation
test_case = LLMTestCase(
    input=webpage[0].page_content,
    actual_output=enhanceResult.Summary,    
)

# Measure the performance of the enhanced result using the same evaluation metrics
summarization.measure(test_case)
clarity.measure(test_case)
tonality.measure(test_case)
safety.measure(test_case)   

# Store the new eval results
evaluation_result = EvaluationResult(
    SummarizationScore=summarization.score,
    SummarizationReason=summarization.reason,
    ClarityScore=clarity.score,
    ClarityReason=clarity.reason,
    TonalityScore=tonality.score,
    TonalityReason=tonality.reason,
    SafetyScore=safety.score,
    SafetyReason=safety.reason
)

# print the eval results
print(evaluation_result)



Output()

Output()

Output()

Output()

SummarizationScore=0.42857142857142855 SummarizationReason='The score is 0.43 because the summary contains contradictions to the original text regarding the definition of noise, and it introduces several pieces of extra information that were not present in the original text, leading to a misrepresentation of the core concepts.' ClarityScore=0.7260257607942462 ClarityReason="The response uses clear and direct language, effectively summarizing the article's exploration of noise. Key concepts such as the cultural variations of the term 'noise' and its evolution from a nuisance to a broader definition are introduced appropriately. However, while the summary captures the essence of the article, some complex ideas, particularly regarding the relationship between noise and power, could be presented more clearly. Additionally, certain phrases like 'cacophony' and 'informational overload' may be vague for some readers, slightly reducing overall understanding." TonalityScore=0.33518641675938815 

##########################################################

Results:

The enhancement did not improve the performance of the model.  It either improved slightly, stayed the same, or got a little worse.  The summarization score is still below the threshold and the clarity, tonality, and safety scores are also low. 
  
I think the main reason for the low improvement is the extra noise and the probablistic nature of the model. The feedback provided to the model from the evaluation is not 
specific enough.

We definitely need stricter controls to ensure the model incorporates the feedback and improves performance.



#########################################################


# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.



# Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
