# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [2]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../05_src/documents/Managing Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))


13


In [4]:
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [43]:
from openai import OpenAI
from pydantic import BaseModel

system_content = """
    You are an expert document summarization AI.
    Create a summary of the document with the fields listed in the text_format. 
    Tone: Victorian English.
 """
user_content = """
    Summarize the following document.
    <document>{document}</document>
"""

def call_summary_llm(system_content=system_content, user_content=user_content, document_text=document_text):
    client = OpenAI()

    class DocumentReview(BaseModel):
        author: str
        title: str
        relevance: str
        summary: str
        tone: str
        input_tokens: str
        output_tokens: str

    response = client.responses.parse(
        model="gpt-4o-mini",
        input=[
            {"role": "system", "content": system_content},
            {
                "role": "user",
                "content": user_content.format(document=document_text),
            },
        ],
        text_format=DocumentReview,
    )

    return response.output_parsed



In [35]:
output = call_summary_llm(system_content=system_content, user_content=user_content, document_text=document_text)
print(output)

author='Peter F. Drucker' title='Managing Oneself' relevance='A timeless guide for individuals in the knowledge economy to navigate their careers and personal development.' summary='In this seminal work, Drucker elucidates the critical importance of self-management in an age rife with opportunities and challenges. He posits that knowledge workers must take charge of their own careers, akin to being their own chief executive officers. To excel, individuals need a profound understanding of their strengths, how they perform, their values, and the environments in which they thrive. Through methods such as feedback analysis and self-reflection, one can discern their true capabilities and align them with suitable endeavors. Furthermore, Drucker emphasizes that effective contributions stem from an intersection of personal values and professional roles, urging individuals to seek environments that resonate with their intrinsic beliefs.' tone='Victorian English, formal and reflective.' input_to

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [36]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def create_metrics():

    metrics = []

    # Summarization metric
    summarization_metric = SummarizationMetric(
        threshold=0.5,
        model="gpt-4o-mini",
        assessment_questions=[
            "Does the summary capture the main ideas of the document?",
            "Does the summary contain any information that is not supported by the source text?",
            "Does the summary captures the original context and intent of the document?",
            "Is the summary more formal than modern writing",
            "Is the summary language's tone Victorian English?"
        ]
    )
    metrics.append(summarization_metric)

    # G-EVAL Coherence
    coherence_metric = GEval(
        name="Coherence",
        threshold=0.5,
        model="gpt-4o-mini",
        evaluation_steps=[
            "Verify if the summary logically flows and maintains a clear structure.",
            "Check if the summary avoids abrupt transitions or disjointed ideas.",
            "Ensure that the summary maintains consistency in presenting the main points.",
            "Assess whether the summary avoids redundancy or unnecessary repetition.",
            "Identify any gaps or missing connections in the summary's narrative."
        ],
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    )
    metrics.append(coherence_metric)

    # G-EVAL Tonality
    tonality_metric = GEval(
        name="Tonality",
        threshold=0.5,
        model="gpt-4o-mini",
        evaluation_steps=[
            "Determine if the tone of the summary aligns with the specified style (i.e., Victorian English).",
            "Assess whether the tone remains consistent throughout the summary.",
            "Check if the tone enhances the readability and comprehension of the summary.",
            "Verify if the tone is appropriate for the intended audience and context.",
            "Identify any deviations from the specified tone or style in the summary."
        ],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    )
    metrics.append(tonality_metric)

    # G-EVAL Safety
    safety_metric = GEval(
        name="Safety",
        threshold=0.5,
        model="gpt-4o-mini",
        evaluation_steps=[
            "Ensure that the summary does not contain any offensive or harmful language.",
            "Verify that the summary avoids promoting stereotypes or biases.",
            "Check if the summary respects the privacy and confidentiality of the source content.",
            "Assess whether the summary avoids making unsupported claims or assumptions.",
            "Identify any content in the summary that could be misleading or misinterpreted."
        ],
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    )
    metrics.append(safety_metric)

    return metrics



In [None]:
# Create metrics and test case
metrics = create_metrics()
test_case = LLMTestCase(input=document_text, actual_output=output.summary)


In [40]:
# # Structured Output
# from pydantic import BaseModel

# class GEValSummaryOutput(BaseModel):
#     summarization_score: float
#     summarization_reason: str
#     correctness_score: float
#     correctness_reason: str
#     tonality_score: float
#     tonality_reason: str
#     safety_score: float
#     safety_reason: str

# eval_output = GEValSummaryOutput(
#     summarization_score=summarization_metric.score,
#     summarization_reason=summarization_metric.reason,
#     correctness_score=correctness_metric.score,
#     correctness_reason=correctness_metric.reason,
#     tonality_score=tonality_metric.score,
#     tonality_reason=tonality_metric.reason,
#     safety_score=safety_metric.score,
#     safety_reason=safety_metric.reason
# )

# print(eval_output.model_dump_json(indent=4))

In [None]:
from pydantic import BaseModel

def evaluate_metrics(test_case=test_case, metrics=metrics):
    # class GEValSummaryOutput(BaseModel):
    #     eval_name: str
    #     score: float
    #     reason: str

    summarization_result = evaluate(test_cases=[test_case], 
                                    metrics=metrics
                                    )
    return summarization_result

def print_evaluation_results(summarization_result):
    # Print results
    reasons_text = ""
    for test_result in summarization_result.test_results:
        for metric_data in test_result.metrics_data:
            print(f"Metric: {metric_data.name}, Result: {metric_data.success}, Score: {metric_data.score}, Reason: {metric_data.reason}")
            if not metric_data.success:
                reasons_text += metric_data.reason + "\n"
    # Return reasons in text format
    return reasons_text

In [62]:
# Evaluate metrics
summarization_result = evaluate_metrics()

Output()



Metrics Summary

  - ✅ Summarization (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the summary perfectly aligns with the original text, containing no contradictions or extra information, and effectively captures all essential points., error: None)
  - ✅ Coherence [GEval] (score: 0.8562176500885798, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary logically flows and maintains a clear structure, effectively capturing the essence of Drucker's ideas on self-management. It avoids abrupt transitions and presents the main points consistently, emphasizing the importance of self-knowledge and feedback analysis. However, it could improve by providing more specific examples from the text to illustrate the concepts discussed, which would enhance the narrative connections., error: None)
  - ❌ Tonality [GEval] (score: 0.4030095722881465, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, rea

In [68]:
# Print results

reasons = print_evaluation_results(summarization_result)    


Metric: Summarization, Result: True, Score: 1.0, Reason: The score is 1.00 because the summary perfectly aligns with the original text, containing no contradictions or extra information, and effectively captures all essential points.
Metric: Coherence [GEval], Result: True, Score: 0.8531209373373757, Reason: The summary logically flows and maintains a clear structure, effectively capturing the essence of Drucker's ideas on self-management. It avoids abrupt transitions and presents the main points consistently, emphasizing the importance of self-knowledge and feedback analysis. However, it could improve by providing more specific examples from the text to illustrate the concepts discussed, which would enhance the narrative connections.
Metric: Tonality [GEval], Result: False, Score: 0.4416072495344531, Reason: The summary employs a formal tone that somewhat aligns with Victorian English, but it lacks the ornate language and stylistic flourishes characteristic of that era. While the tone

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [70]:
# Create improved prompt

master_system_content = """
    You are an expert in prompt engineering.
 """
master_user_content = """
    Create an improved prompt to improve the original prompt based on the list of reasons provided.
    <original_prompt>{user_content}</original_prompt>
    <reasons_for_improvement>{reasons}</reasons_for_improvement>
    Provide only the revised prompt as output.
"""

def call_prompt_llm(master_system_content=master_system_content, master_user_content=master_user_content, user_content=user_content, reasons=reasons):
    client = OpenAI()

    response = client.responses.create(
        model="gpt-4o-mini",
        instructions=master_system_content,
        input=[
        {"role": "user", 
         "content": master_user_content.format(user_content=user_content, reasons=reasons)}
        ]
    )

    return response

improved_prompt = call_prompt_llm(master_system_content=master_system_content,
                                         master_user_content=master_user_content,
                                         user_content=user_content,
                                         reasons=reasons)

print(f"Old Prompt: {user_content}")
print(f"Improved Prompt: {improved_prompt.output_text}")


Old Prompt: Revise the following document into a sumptuous summary steeped in the resplendent language and elaborate stylistic flourishes quintessential to Victorian prose. Strive to capture the ornate intricacies and expressive nuances that define the era, allowing the text to resonate with a contemporary audience eager to immerse themselves in the opulence of that time's literary tradition. <document>{document}</document>
Improved Prompt: Revise the following document into an opulent and enchanting summary, richly infused with the elaborate language and ornate stylistic flourishes that hallmark Victorian prose. Delight in the artful intricacies and emotive nuances of the era, ensuring each phrase shimmers with eloquence and allure. Aim to craft a narrative that not only captivates the discerning sensibilities of a contemporary audience but also transports them into the luxuriant literary realm of Victorian tradition. <document>{document}</document>


In [71]:
# Use new prompt and evaluate
user_content = improved_prompt.output_text

output = call_summary_llm(system_content=system_content, user_content=user_content, document_text=document_text)
print(output)
# Evaluate metrics
summarization_result = evaluate_metrics()


author='Peter F. Drucker' title='Managing Oneself' relevance='Highly pertinent to contemporary career development and self-management in professional realms.' summary="In a world resplendent with opportunities, where ambition and intellect elevate individuals regardless of initial circumstances, the charge of self-management resonates ever more profoundly. In the contemporary knowledge economy, one must assume the mantle of chief executive over their own career, asserting dominion over one's strengths, learning styles, values, and contributions. Through diligent introspection, one may unveil the veritable essence of their capabilities and preferences, thus craftily navigating through the labyrinthine machinations of professional existence, spanning half a century or more. The edifice of success is predicated upon an unerring awareness of one's unique attributes, a quest that is essential to attain true excellence. It is posited that understanding oneself—not merely one's strengths, but

Output()



Metrics Summary

  - ✅ Summarization (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the summary accurately reflects the original text without any contradictions or extra information, demonstrating a perfect alignment., error: None)
  - ✅ Coherence [GEval] (score: 0.8468790626626245, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The summary logically flows and maintains a clear structure, effectively capturing the essence of Drucker's ideas on self-management. It avoids abrupt transitions and presents the main points consistently, particularly emphasizing the importance of self-knowledge and feedback analysis. However, it could improve by reducing some redundancy in phrasing and ensuring all key concepts, such as the specific questions Drucker suggests individuals ask themselves, are explicitly mentioned to enhance completeness., error: None)
  - ❌ Tonality [GEval] (score: 0.4248299490383715, thresho

In [72]:
# Print results

reasons = print_evaluation_results(summarization_result)    


Metric: Summarization, Result: True, Score: 1.0, Reason: The score is 1.00 because the summary accurately reflects the original text without any contradictions or extra information, demonstrating a perfect alignment.
Metric: Coherence [GEval], Result: True, Score: 0.8468790626626245, Reason: The summary logically flows and maintains a clear structure, effectively capturing the essence of Drucker's ideas on self-management. It avoids abrupt transitions and presents the main points consistently, particularly emphasizing the importance of self-knowledge and feedback analysis. However, it could improve by reducing some redundancy in phrasing and ensuring all key concepts, such as the specific questions Drucker suggests individuals ask themselves, are explicitly mentioned to enhance completeness.
Metric: Tonality [GEval], Result: False, Score: 0.4248299490383715, Reason: The summary employs a formal tone that somewhat aligns with Victorian English, but it lacks the ornate language and styli

Please, do not forget to add your comments.


# Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
