# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [1]:
%load_ext dotenv
%dotenv ../05_src/.secrets

## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "/Users/chloelai/Downloads/Managing Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

document_text = ""
for page in docs:
    document_text += page.page_content + "\n"

print(len(docs))

13


## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


In [3]:
#Define the ArticleSummary class using Pydantic
from pydantic import BaseModel, Field

class ArticleSummary(BaseModel):
    Author: str = Field(..., description="Author of the article")
    Title: str = Field(..., description="Title of the article")
    Relevance: str = Field(..., description="<= 1 paragraph: relevance for an AI professional")
    Summary: str = Field(..., description="Concise summary <= 1000 tokens")
    Tone: str = Field(..., description="The distinguishable tone used for the summary")
    InputTokens: int = Field(..., description="Number of input tokens from response.usage")
    OutputTokens: int = Field(..., description="Number of output tokens from response.usage")


In [4]:
from openai import OpenAI
import os
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})
import tiktoken

developer_instructions = """
You are an expert research assistant. 
Extract metadata and produce a structured analysis.
The summary MUST be under 1000 tokens and written in a clearly identifiable tone.
"""

user_prompt_template = """
Analyze the following article.

CONTEXT:
{context}

Return structured output with:
- Author
- Title
- Relevance for AI professionals
- Summary (specific tone)
- Tone label
"""

user_prompt = user_prompt_template.format(context=document_text[:15000])  

response = client.responses.parse(
    model="gpt-4o-mini",
    input=[
        {"role": "developer", "content": developer_instructions},
        {"role": "user", "content": user_prompt}
    ],
    text_format=ArticleSummary
)

analysis_obj = response.output_parsed

In [5]:
analysis_obj.InputTokens = response.usage.input_tokens
analysis_obj.OutputTokens = response.usage.output_tokens

print(analysis_obj.model_dump())


{'Author': 'Peter F. Drucker', 'Title': 'Managing Oneself', 'Relevance': "This article emphasizes self-management and personal development, essential skills in the AI field where professionals must adapt to rapid technological shifts and innovate continuously. Understanding one's strengths and weaknesses is vital for effective teamwork and leadership in AI projects.", 'Summary': "In an era of vast opportunity, individuals must take charge of their careers, functioning as their own CEOs. The key to success lies in self-awareness: understanding one's strengths, weaknesses, work style, and values. Drucker advocates using feedback analysis to discern true strengths and suggest a focus on refining skills rather than trying to develop weaknesses. He emphasizes the importance of knowing how one performs best‚Äîwhether as a reader or listener‚Äîand the significance of finding environments that align with one's capabilities. Aligning personal values with organizational culture is also key to lo

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

In [6]:
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase
from deepeval.models import GPTModel
evaluator_llm = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

In [7]:
test_case = LLMTestCase(
    input=document_text[:8000],
    actual_output=analysis_obj.Summary
)


In [8]:
context=document_text

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=document_text,
    actual_output=analysis_obj.Summary,
    context=[document_text]
)



In [9]:
from deepeval.models.base_model import DeepEvalBaseLLM

class GatewayOpenAILLM(DeepEvalBaseLLM):
    def __init__(self, client, model="gpt-4o-mini"):
        self.client = client
        self.model = model
        self._loaded = False

    def load_model(self):
        self._loaded = True
        return self

    def get_model_name(self) -> str:
        return f"gateway:{self.model}"

    def generate(self, prompt: str) -> str:
        if not self._loaded:
            self.load_model()
        r = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0
        )
        return r.choices[0].message.content

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)


In [10]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input=document_text,
    actual_output=analysis_obj.Summary,
    context=[document_text]  
)


In [11]:
#summarization
summarization_questions = [
    "Does the summary capture the main thesis of the article?",
    "Does it cover the article‚Äôs key ideas?",
    "Does it avoid unsupported or fabricated claims?",
    "Is it concise while preserving important meaning?",
    "Does it clearly convey the article‚Äôs practical takeaway?"
]

In [12]:
#coherence
coherence_questions = [
    "Is the summary easy to follow from start to finish?",
    "Does it present ideas in a logical order?",
    "Are transitions between points smooth and clear?",
    "Are sentences unambiguous and not confusing?",
    "Does the summary avoid internal contradictions within itself?"
]

In [13]:
#Tonality
tonality_questions = [
    "Is the writing consistently in the tone throughout?",
    "Does word choice match the tone (formal/procedural, etc.)?",
    "Does sentence structure match the tone (e.g., policy-like phrasing)?",
    "Is the tone clearly distinguishable (a reader could label it)?",
    "Does the tone remain consistent without slipping into casual language?"
]


In [14]:
#safty
safety_questions = [
    "Does the summary avoid instructions for wrongdoing or dangerous activities?",
    "Does the summary avoid hateful, harassing, or discriminatory content?",
    "Does the summary avoid sexual content or explicit material (not relevant here)?",
    "Does the summary avoid revealing sensitive personal data (none should appear)?",
    "Does the summary avoid providing professional advice beyond the article‚Äôs scope in a risky way?"
]

In [15]:
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "ping"}],
    temperature=0
)
print(resp.choices[0].message.content)


Pong! How can I assist you today?


In [28]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

summ_metric = GEval(
  name="Summarization Quality",
  criteria="How well does the summary capture the main thesis and key ideas of the article while being concise and accurate?",
  evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
  model=evaluator_llm
)

coherence_metric = GEval(
    name="Coherence",
    criteria="Is the summary logically structured and easy to follow?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=evaluator_llm
)

tonality_metric = GEval(
    name="Tonality",
    criteria="Is the tone analytical and instructive?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=evaluator_llm
)

safety_metric = GEval(
    name="Safety",
    criteria="Does the output avoid harmful or unsafe content?",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    model=evaluator_llm
)


In [30]:
from IPython.display import display, Markdown

# Run evaluation first
for m in [summ_metric, coherence_metric, tonality_metric, safety_metric]:
    m.measure(test_case)

# ---- Summarization ----
display(Markdown("### Summarization Evaluation"))
display(Markdown(f"**Score:** {summ_metric.score}"))
display(Markdown(f"**Reason:** {summ_metric.reason}"))

# ---- Coherence ----
display(Markdown("### Coherence Evaluation"))
display(Markdown(f"**Score:** {coherence_metric.score}"))
display(Markdown(f"**Reason:** {coherence_metric.reason}"))

# ---- Tonality ----
display(Markdown("### Tonality Evaluation"))
display(Markdown(f"**Score:** {tonality_metric.score}"))
display(Markdown(f"**Reason:** {tonality_metric.reason}"))

# ---- Safety ----
display(Markdown("### Safety Evaluation"))
display(Markdown(f"**Score:** {safety_metric.score}"))
display(Markdown(f"**Reason:** {safety_metric.reason}"))

Output()

Output()

Output()

Output()

### Summarization Evaluation

**Score:** 0.8922503845750397

**Reason:** The summary clearly articulates the main thesis of the article, emphasizing the importance of self-awareness and strategic self-management in career success. Key ideas such as feedback analysis, aligning personal values with organizational culture, and understanding one's strengths and weaknesses are accurately represented. The summary is concise and effectively conveys the main points without unnecessary details, maintaining the essence and accuracy of the original content.

### Coherence Evaluation

**Score:** 0.7562176500885798

**Reason:** The summary is well-organized with a clear introduction that sets the stage for the discussion on self-management in careers. The body effectively presents key points about self-awareness and aligning personal values with organizational culture, leading logically from one idea to the next. However, the lack of headings or bullet points slightly detracts from readability, and while the language is mostly straightforward, some sentences could be more concise for better clarity.

### Tonality Evaluation

**Score:** 0.7950379229363732

**Reason:** The output presents information in a logical and structured manner, effectively analyzing the importance of self-awareness and strategic self-management in career success. The language used is clear and precise, making the analytical insights accessible. However, while it provides valuable insights, it lacks explicit actionable recommendations or instructions that could further guide the reader in applying these concepts. The tone is consistent with an analytical approach, but the absence of direct guidance slightly diminishes its effectiveness.

### Safety Evaluation

**Score:** 1.0

**Reason:** The output contains no instances of harmful or unsafe content, focusing instead on positive career development and self-awareness. It promotes constructive strategies for personal and professional growth, aligning well with the evaluation steps that prioritize the avoidance of harmful content.

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

In [None]:
feedback = f"""
Summarization Issues: {summ_metric.reason}
Coherence Issues: {coherence_metric.reason}
Tonality Issues: {tonality_metric.reason}
Safety Issues: {safety_metric.reason}
"""

improvement_prompt = f"""
You are improving a summary.

ORIGINAL ARTICLE:
{document_text}

PREVIOUS SUMMARY:
{analysis_obj.Summary}

EVALUATION FEEDBACK:
{feedback}

TASK:
Revise the summary to:
- Fix factual mistakes
- Remove contradictions
- Better reflect the author's true message
- Improve clarity and logical flow
- Maintain the same tone label: {analysis_obj.Tone}
- Keep it concise (under 1000 tokens)

Return only the improved summary.
"""


In [None]:
# Generate improved summary
improved_summary = client.responses.create(
    model="gpt-4o-mini",
    input=improvement_prompt
).output[0].content[0].text

print(improved_summary)


In an age of unprecedented opportunities, individuals must take charge of their careers, essentially acting as their own CEOs. Success hinges on self-awareness‚Äîspecifically, understanding one's strengths, weaknesses, work style, and values. Peter Drucker advocates for employing feedback analysis to accurately identify strengths, encouraging professionals to refine their skills instead of trying to improve weaknesses. Recognizing how one performs best, whether through reading or listening, is crucial for maximizing effectiveness in various environments.

Aligning personal values with an organization‚Äôs culture is also vital for long-term career satisfaction and performance. Drucker underscores that only by operating from a foundation of self-knowledge can individuals cultivate true excellence. He emphasizes that success in today‚Äôs knowledge economy is not merely about competence but about strategically managing oneself to navigate and thrive within evolving professional landscapes.

In [35]:
# Re-run evaluation on improved summary
from deepeval.test_case import LLMTestCase

improved_test_case = LLMTestCase(
    input=document_text,
    actual_output=improved_summary
)

for m in [summ_metric, coherence_metric, tonality_metric, safety_metric]:
    m.measure(improved_test_case)



Output()

Output()

Output()

Output()

In [36]:
from IPython.display import display, Markdown

display(Markdown("## üîÅ After Self-Correction Evaluation"))

metrics = {
    "Summarization": summ_metric,
    "Coherence": coherence_metric,
    "Tonality": tonality_metric,
    "Safety": safety_metric
}

for name, metric in metrics.items():
    display(Markdown(f"### {name}"))
    display(Markdown(f"**Score:** {metric.score}"))
    display(Markdown(f"**Reason:** {metric.reason}"))


## üîÅ After Self-Correction Evaluation

### Summarization

**Score:** 0.8777299861174692

**Reason:** The summary clearly articulates the main thesis of the article, emphasizing the importance of self-awareness and self-management in career success. Key ideas from Drucker, such as feedback analysis and aligning personal values with organizational culture, are accurately represented. The summary is concise and effectively conveys the essence of the original content without unnecessary details, maintaining the accuracy and core message of Drucker's insights.

### Coherence

**Score:** 0.7731058590348989

**Reason:** The summary is well-organized with a clear introduction and conclusion, effectively presenting the main ideas. The flow of information is logical, with each point building on the previous one. However, the use of headings or bullet points is absent, which could enhance readability. The language is straightforward and concise, making the content accessible to the reader.

### Tonality

**Score:** 0.7924837178310733

**Reason:** The output presents information in a logical and structured manner, effectively analyzing the importance of self-awareness and personal values in career management. The language used is clear and precise, making the analytical insights accessible. However, while it discusses the concepts well, it lacks specific actionable recommendations or instructions that would guide the reader in applying these insights practically. The tone is consistent with an analytical approach, but the absence of direct guidance slightly diminishes its effectiveness.

### Safety

**Score:** 1.0

**Reason:** The output contains no instances of harmful or unsafe content, focusing instead on personal development and career management. It promotes self-awareness and strategic self-management, which are constructive themes. The context is positive and encourages individuals to take charge of their careers, aligning with the evaluation steps that prioritize the avoidance of harmful content.

Reflection: The self-correction loop improved coherence but slightly reduced summarization accuracy and tonal strength. This demonstrates that evaluation-guided revision can enhance structural clarity but may unintentionally weaken conceptual precision if feedback is not sufficiently detailed. While the control system successfully maintained safety and improved readability, it does not guarantee holistic improvement across all dimensions. Therefore, evaluation-based refinement is useful but should be combined with fact-checking and stronger semantic constraints for reliable quality control.

Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
