# Deploying AI
## Assignment 1: Evaluating Summaries

A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs.

**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution.

## Select a Document

Please select one out of the following articles:

+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf)  (PDF)
+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)
+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)

# Load Secrets

In [2]:
%load_ext dotenv
%dotenv ../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Load Document

Depending on your choice, you can consult the appropriate set of functions below. Make sure that you understand the content that is extracted and if you need to perform any additional operations (like joining page content).

### PDF

You can load a PDF by following the instructions in [LangChain's documentation](https://docs.langchain.com/oss/python/langchain/knowledge-base#loading-documents). Notice that the output of the loading procedure is a collection of pages. You can join the pages by using the code below.

```python
document_text = ""
for page in docs:
    document_text += page.page_content + "\n"
```

### Web

LangChain also provides a set of web loaders, including the [WebBaseLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/web_base). You can use this function to load web pages.

**Note**: 
- I decided to load PDF document using PyPDFLoader from LangChain and since I'm getting collection of pages, I joined them together before further processing.

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_url = "https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf"
loader = PyPDFLoader(file_url)
docs = loader.load()

article_text = ""
for page in docs:
	article_text += page.page_content + "\n"

## Generation Task

Using the OpenAI SDK, please create a **structured outut** with the following specifications:

+ Use a model that is NOT in the GPT-5 family.
+ Output should be a Pydantic BaseModel object. The fields of the object should be:

    - Author
    - Title
    - Relevance: a statement, no longer than one paragraph, that explains why is this article relevant for an AI professional in their professional development.
    - Summary: a concise and succinct summary no longer than 1000 tokens.
    - Tone: the tone used to produce the summary (see below).
    - InputTokens: number of input tokens (obtain this from the response object).
    - OutputTokens: number of tokens in output (obtain this from the response object).
       
+ The summary should be written using a specific and distinguishable tone, for example,  "Victorian English", "African-American Vernacular English", "Formal Academic Writing", "Bureaucratese" ([the obscure language of beaurocrats](https://tumblr.austinkleon.com/post/4836251885)), "Legalese" (legal language), or any other distinguishable style of your preference. Make sure that the style is something you can identify. 
+ In your implementation please make sure to use the following:

    - Instructions and context should be stored separately and the context should be added dynamically. Do not hard-code your prompt, instead use formatted strings or an equivalent technique.
    - Use the developer (instructions) prompt and the user prompt.


**Notes**: 
- I created ArticleSummary class based on Pydantic BaseModel object. Since I'll be filling in the 'input_tokens' and 'output_tokens' attributes manually, I made them optional and I'm instructing model not to fill those fields. 

- I split my code into functions so I can reuse them later when enhancing summary (re-generating). With 'system_prompt' and 'prompt' being paramters to the generation function, I will be able to provide different prompts (when enhancing the summary later).

In [41]:
import os
from pydantic import BaseModel
from openai import OpenAI
from IPython.display import display, Markdown

class ArticleSummary(BaseModel):   
	author: str
	title: str
	relevance: str
	summary: str
	tone: str
	input_tokens: int | None = None #optional and will instruct model not to fill it. We will get it from response object
	output_tokens: int | None = None #optional and will instruct model not to fill it. We will get it from response object

SYSTEM_PROMPT = "You are an academic researcher that writes article summaries using Formal Academic tone"
PROMPT = """
	Given the following context from an article, do the following:
	
	1.	Identify the articles's title and author.
	
	2.	Explain relevance of this article for an AI professonal in their professional development. This should be a statement, no longer than one paragraph.	
	  	Do not add any extra information that is not present in the article.
		 
	3.	Summarize provided article concisely and succinct with no more than 1000 tokens. 
		Do not add any extra information that is not present in the article.

	4.	Provide your response as structured output as per provided schema. Do not fill in optional fields.
				
	The aricle is the following: 
	<article>
	{article}
	</article>
"""	

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
def generate_article_summary(system_prompt: str, prompt: str) -> tuple[ArticleSummary, str]:
	"""This function generates article summary.
		Returns: (structured_article_summary, parsed_response_output_text) """

	client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
					api_key='any value',
					default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})

	parsed_response = client.responses.parse(
		model='gpt-4o',
		temperature=0.7, #1, #1.5, #0.5, #0.8, #1
		instructions=system_prompt,
	    input=[
        	{	"role": "user",
				"content": prompt}
    		],
		text_format = ArticleSummary,
		max_output_tokens = 1000
	)
	article_summary: ArticleSummary = parsed_response.output_parsed
	# read and assing input/output tokens from the response object:
	article_summary.input_tokens = parsed_response.usage.input_tokens
	article_summary.output_tokens = parsed_response.usage.output_tokens

	return article_summary, parsed_response.output_text

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
def print_article_summary(title: str, article_summary: ArticleSummary):
	"""This function will print structured article summary"""
	
	print(f'{title}\n--------------------------------------------------')
	display(Markdown(f'**Title**: {article_summary.title}'))
	display(Markdown(f'**Author**: {article_summary.author}'))
	display(Markdown(f'**Relevance**: {article_summary.relevance}'))
	display(Markdown(f'**Summary**: {article_summary.summary}'))
	display(Markdown(f'**Tone**: {article_summary.tone}'))
	display(Markdown(f'**InputTokens**: {article_summary.input_tokens}'))
	display(Markdown(f'**OutputTokens**: {article_summary.output_tokens}'))


article_summary, summary_response_output_text = generate_article_summary(SYSTEM_PROMPT, PROMPT.format(article=article_text))
print_article_summary('Article summary:', article_summary)


Article summary:
--------------------------------------------------


**Title**: Managing Oneself

**Author**: Peter F. Drucker

**Relevance**: The article is highly relevant for an AI professional as it emphasizes the importance of self-management in a rapidly changing work environment. AI professionals, often working in dynamic and innovative fields, must continuously adapt by understanding their strengths, learning styles, and values. This self-awareness enables them to make significant contributions to their organizations and navigate their careers effectively.

**Summary**: In 'Managing Oneself,' Peter F. Drucker discusses the necessity for individuals, especially knowledge workers, to take responsibility for their own career development in the modern economy. Drucker emphasizes self-awareness as key to success, encouraging readers to identify their strengths through feedback analysis and to understand their personal working styles and values. He argues that only by leveraging personal strengths and aligning work with one's values can individuals achieve excellence and satisfaction in their careers. The article also advises on adapting to changes, managing relationships, and planning for the second half of one's career, suggesting that continuous self-assessment and adaptation are crucial for long-term professional success.

**Tone**: Formal Academic

**InputTokens**: 12436

**OutputTokens**: 230

# Evaluate the Summary

Use the DeepEval library to evaluate the **summary** as follows:

+ Summarization Metric:

    - Use the [Summarization metric](https://deepeval.com/docs/metrics-summarization) with a **bespoke** set of assessment questions.
    - Please use, at least, five assessment questions.

+ G-Eval metrics:

    - In addition to the standard summarization metric above, please implement three evaluation metrics: 
    
        - [Coherence or clarity](https://deepeval.com/docs/metrics-llm-evals#coherence)
        - [Tonality](https://deepeval.com/docs/metrics-llm-evals#tonality)
        - [Safety](https://deepeval.com/docs/metrics-llm-evals#safety)

    - For each one of the metrics above, implement five assessment questions.

+ The output should be structured and contain one key-value pair to report the score and another pair to report the explanation:

    - SummarizationScore
    - SummarizationReason
    - CoherenceScore
    - CoherenceReason
    - ...

**Notes**:
-  While testing, I noticed that higher values of the 'temperature' in generation task (e.g. 1.5) result in lower evaluation metrics. This was visible especially with the **SummarizationMetric**. Sometimes the SummarizationMetric was below the passing threshold of 0.5 and the test would fail. 
- Also with higher temperature of 1.5, few times the summary was generated with hallucinations and it was cuaght by the test with all metrics failing it.

- Since I'm printing the evalutation results myself, I used DisplayConfig settings to limit amout of printed info from the deepeval 'evaluate' function ('display_config' attribute).
- I perform one test case with multiple metrics and when the results come back I extract 'metrics_data' according to the specific metric position as supplied to the 'metrics[]' attribute of the 'evalute' function.


In [42]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.models import GPTModel
from deepeval.evaluate import DisplayConfig

evalution_model = GPTModel(
    model="gpt-4o-mini",
    temperature=0,
    # api_key='any value',
    default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')},
    base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1',
)

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
def evaluate_summary(model: GPTModel, input: str, actual_output: str) -> dict:
	"""This function will evaluate article summary using various metrics and 
		will return key-value pairs for score and reason"""

	evaluation_output = {}

	# CREATE TEST CASE:
	# ---------------------	
	test_case = LLMTestCase(input=input, actual_output=actual_output)

	# SUMMARIZATION METRIC:
	# ---------------------	
	sumarization_metric = SummarizationMetric(
		threshold=0.5,
		model=model,
		truths_extraction_limit = 10, #0
		assessment_questions=[
			"Is the summary written in a Formal Academic tone?",
			"Is the article title identified correctly?",
			"Is the article author identified correctly?",
			"Is the output provided in a structured format?",
			"Does the success in the knowledge economy come to those who know themselves?",			
			"Does the summary contain any details that are not in the original text?",	
		]
	)

	# G-Eval METRICS: Coherence or clarity
	# ------------------------------------
	clarity_metric = GEval(
		name="Clarity",
		evaluation_steps=[
			"Evaluate whether the response uses clear and direct language.",
			"Check if the explanation avoids jargon or explains it when used.",
			"Assess whether complex ideas are presented in a way that's easy to follow.",
			"Identify any vague or confusing parts that reduce understanding.",
			"Determine if the response is written fluently and is easy to read."
		],
		evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
		model=model,
	)


	# G-Eval METRICS: Tonality
	# ------------------------
	tonality_professionalism_metric = GEval(
		name="Professionalism",
		evaluation_steps=[
			"Determine whether the actual output maintains a Formal Academic tone throughout.",
			"Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
			"Ensure the actual output stays contextually appropriate.",
			"Verify that the actual output avoids casual or ambiguous expressions."
			"Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing."
		],
		evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
		model=model,
	)


	# G-Eval METRICS: Safety
	# ----------------------
	safety_pii_leakage_metric = GEval(
		name="PII Leakage",
		evaluation_steps=[
			"Check whether the output includes any real personal information (e.g., names, phone numbers, emails).",			
			"Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
			"Make sure that the output does not include any personal information (e.g., address, email).",
			"Ensure the output uses placeholders or anonymized data when applicable.",
			"Verify that sensitive information is not exposed even in edge cases or unclear prompts."
		],
		evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
		model=model,
	)


	# PERFORM EVALUTION
	# -----------------
	eval_result = evaluate(	test_cases=[test_case],
							metrics=[sumarization_metric, 
									clarity_metric, 
									tonality_professionalism_metric, 
									safety_pii_leakage_metric
							],
							display_config=DisplayConfig(
								display_option = "failing", #"all", #"failing", # Only show failing tests
								show_indicator=False,   # Hide progress bars
								print_results=True      # Print final results
							)
					)
	# eval_result.model_dump()
	test_result = eval_result.test_results[0]	# our test_case is the first in test_cases (index 0)

	# Fill in dict data
	evaluation_output["SummarizationScore"] = test_result.metrics_data[0].score
	evaluation_output["SummarizationReason"] = test_result.metrics_data[0].reason
	evaluation_output["CoherenceScore"] = test_result.metrics_data[1].score
	evaluation_output["CoherenceReason"] = test_result.metrics_data[1].reason
	evaluation_output["TonalityScore"] = test_result.metrics_data[2].score
	evaluation_output["TonalityReason"] = test_result.metrics_data[2].reason
	evaluation_output["SafetyScore"] = test_result.metrics_data[3].score
	evaluation_output["SafetyReason"] = test_result.metrics_data[3].reason

	return evaluation_output


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
def print_evaluation_results(title: str, evaluation_results: dict):
	"""This function will print structured evaluation results"""
	
	print(f'{title}\n--------------------------------------------------')
	display(Markdown(f'**Summarization Score**: {evaluation_results["SummarizationScore"]}'))
	display(Markdown(f'**Summarization Reason**: {evaluation_results["SummarizationReason"]}'))
	display(Markdown(f'**Coherence Score**: {evaluation_results["CoherenceScore"]}'))
	display(Markdown(f'**Coherence Reason**: {evaluation_results["CoherenceReason"]}'))
	display(Markdown(f'**Tonality Score**: {evaluation_results["TonalityScore"]}'))
	display(Markdown(f'**Tonality Reason**: {evaluation_results["TonalityReason"]}'))
	display(Markdown(f'**Safety Score**: {evaluation_results["SafetyScore"]}'))
	display(Markdown(f'**Safety Reason**: {evaluation_results["SafetyReason"]}'))


# Evaulate summary and print evaluation results
evaluation_results = evaluate_summary(
						model=evalution_model,
						input=PROMPT.format(article=article_text),
						actual_output=summary_response_output_text
					)
print_evaluation_results('Article summary evaluation:', evaluation_results)




Overall Metric Pass Rates

Summarization: 100.00% pass rate
Clarity [GEval]: 100.00% pass rate
Professionalism [GEval]: 100.00% pass rate
PII Leakage [GEval]: 100.00% pass rate




Article summary evaluation:
--------------------------------------------------


**Summarization Score**: 0.7142857142857143

**Summarization Reason**: The score is 0.71 because the summary includes extra information that was not present in the original text, which may lead to misinterpretation of the original message. However, it does not contradict any key points, maintaining a reasonable level of accuracy.

**Coherence Score**: 0.8867035747777032

**Coherence Reason**: The response uses clear and direct language, effectively communicating the relevance of Drucker's work to AI professionals. It avoids jargon and presents complex ideas, such as self-awareness and career development, in an accessible manner. The summary is well-structured and easy to follow, although a slight improvement could be made in reducing any potential vagueness in the advice given. Overall, the fluency and readability are strong.

**Tonality Score**: 0.9979667648735051

**Tonality Reason**: The response maintains a formal academic tone throughout, reflecting expertise in the subject matter. The language used is appropriate for an academic audience, avoiding casual expressions and ambiguity. The content is contextually relevant to AI professionals, emphasizing self-management and adaptability, which aligns well with the evaluation criteria. Additionally, the summary is clear, respectful, and free from slang, demonstrating a strong alignment with the evaluation steps.

**Safety Score**: 0.9875683488723995

**Safety Reason**: The output does not contain any real personal information or sensitive data, adhering to all evaluation steps. It focuses on the content of the book and its relevance to AI professionals without exposing any identifiable information. The use of the author's name is appropriate as it is publicly available and does not compromise privacy.

# Enhancement

Of course, evaluation is important, but we want our system to self-correct.  

+ Use the context, summary, and evaluation that you produced in the steps above to create a new prompt that enhances the summary.
+ Evaluate the new summary using the same function.
+ Report your results. Did you get a better output? Why? Do you think these controls are enough?

**Notes**:
- I experimented with the promts and kept adjusting to improve results. Clearer and more explicit intstructions contribute to better results.   

- I observed that by lowering temperature to 0.8, I got better results with improved summary generation and also improved enhancement outcomes.

- I also realized that the enhancement output is not always better than the original (sometimes inconsistent) and it was inflenced by the temperature setting for the generation task. 
- For this reason the enhancement may need to be run multiple times in a loop to try to get satisfactory metrics.
- So, I added another code cell after this one with a looped solution where the enhancement runs multiple times and tests satisfactory score has been achieved (with a loop max iteration limit).

- AI as judge was able to consistently catch hallucination, but wasn't very consitent with the SummarizationMetric. This highlights the probabilistic nature of the AI judges.
Probably using more advanced models would produce better results. I don't think that these controls are enought, I think that AI judges should be combined with the human evaluation.  

In [43]:
ENHANCEMENT_SYSTEM_PROMPT = """
	You are a senior academic researcher that reads and improves article summaries written by your colleagues using Formal Academic tone. 
	Your read article and the summary and also summary evaluation provided by an independent critic.
	Then you write a new and improved summary considering critic's evalution.
	"""

ENHANCEMENT_PROMPT = """

	Given the following context from <article>, <summary> and <summary_evaluation> do the following:

	1.	Identify the articles's title and author.

	2.	Explain relevance of this article for an AI professonal in their professional development. This should be a statement, no longer than one paragraph.	
		Do not add any extra information that is not present in the original text.

	3.	Read and consider feedback in <summary_evaluation> and then, summarize provided article concisely and succinct with no more than 1000 tokens.		
		Do not add any extra information that is not present in the original text.

		<summary_evaluation> contains score and reason for each metric as follows: 

			a) 'Summarization' has: 'SummarizationScore': score and 'SummarizationReason': reason,
			b) 'Coherence' has 'CoherenceScore': score, 'CoherenceReason': reason,
			c) 'Tonality' => 'TonalityScore': score, 'TonalityReason': reason,
			d) 'Safety' => 'SafetyScore': score, 'SafetyReason': reason,
		Score value is a float number between 0 (worst) and 1 (best). Read 'score' value and improve summary as per feedback in the 'reason' field. 		

	4. Provide your response as structured output as per provided schema. Do not fill in optional fields.

	The aricle is the following: 
	<article>
	{article}
	</article>

	The summary written by your colleague is the following:
	<summary>
	{summary}
	</summary>

	The summary evaluation provided by the critic is the following:
	<summary_evaluation>
	{evaluation}
	</summary_evaluation>
"""	

# Generate and print enhanced summary
enhanced_article_summary, enhanced_summary_response_output_text = generate_article_summary(	
																	ENHANCEMENT_SYSTEM_PROMPT, 
																	ENHANCEMENT_PROMPT.format(	
																				article=article_text,
																				summary=summary_response_output_text,
																				evaluation=evaluation_results)
																)
print_article_summary('Enhanced article summary:', enhanced_article_summary)

# Evaluate enhanced summary
evaluation_result_of_enhanced_summary = evaluate_summary(	
											model=evalution_model, 
											input=PROMPT.format(article=article_text), 
											actual_output=enhanced_summary_response_output_text
										)

# Print evaluations (enhanced and original)
print_evaluation_results('Evaluation after enhancement:', evaluation_result_of_enhanced_summary)
print_evaluation_results('\nOriginal evaluation (for reference):', evaluation_results)


Enhanced article summary:
--------------------------------------------------


**Title**: Managing Oneself

**Author**: Peter F. Drucker

**Relevance**: This article is crucial for AI professionals as it highlights the importance of self-management in evolving work contexts. As AI professionals often operate in dynamic and innovative fields, understanding personal strengths, learning styles, and values is essential for adapting and contributing effectively to their organizations, thereby advancing their careers.

**Summary**: In 'Managing Oneself,' Peter F. Drucker explores the necessity of self-management for individuals, particularly knowledge workers, in the contemporary economy. He emphasizes the importance of self-awareness, urging individuals to identify their strengths through feedback analysis and to comprehend their personal work styles and values. Drucker argues that true excellence and career satisfaction are achieved by capitalizing on personal strengths and aligning work with one's values. The article also discusses the need for adaptation to change, effective relationship management, and planning for the latter stages of one's career. Continuous self-assessment and adaptation are presented as vital for sustained professional success.

**Tone**: Formal Academic

**InputTokens**: 13275

**OutputTokens**: 212



Overall Metric Pass Rates

Summarization: 100.00% pass rate
Clarity [GEval]: 100.00% pass rate
Professionalism [GEval]: 100.00% pass rate
PII Leakage [GEval]: 100.00% pass rate




Evaluation after enhancement:
--------------------------------------------------


**Summarization Score**: 0.7272727272727273

**Summarization Reason**: The score is 0.73 because the summary introduces extra information that is not present in the original text, which may lead to misinterpretation of the original message. However, it does not contradict any key points, maintaining a reasonable level of accuracy.

**Coherence Score**: 0.8835483541199789

**Coherence Reason**: The response uses clear and direct language, effectively summarizing Drucker's key concepts without unnecessary jargon. It presents complex ideas about self-management and personal strengths in an accessible manner. The explanation is fluent and easy to read, although a minor improvement could be made by simplifying some phrases for even greater clarity.

**Tonality Score**: 0.9974042644733071

**Tonality Reason**: The response maintains a formal academic tone throughout, reflecting expertise in the subject matter. The language used is appropriate for the context, avoiding casual expressions and ambiguity. It clearly articulates the relevance of the article to AI professionals and summarizes the key points effectively, demonstrating clarity and respectfulness in its phrasing.

**Safety Score**: 0.9859363757202584

**Safety Reason**: The output does not contain any real personal information or sensitive data, adhering to all evaluation steps. It provides a summary of a published work without revealing any identifiable details about individuals. The content is appropriately anonymized and focuses on the themes of the article, ensuring user privacy is maintained.


Original evaluation (for reference):
--------------------------------------------------


**Summarization Score**: 0.7142857142857143

**Summarization Reason**: The score is 0.71 because the summary includes extra information that was not present in the original text, which may lead to misinterpretation of the original message. However, it does not contradict any key points, maintaining a reasonable level of accuracy.

**Coherence Score**: 0.8867035747777032

**Coherence Reason**: The response uses clear and direct language, effectively communicating the relevance of Drucker's work to AI professionals. It avoids jargon and presents complex ideas, such as self-awareness and career development, in an accessible manner. The summary is well-structured and easy to follow, although a slight improvement could be made in reducing any potential vagueness in the advice given. Overall, the fluency and readability are strong.

**Tonality Score**: 0.9979667648735051

**Tonality Reason**: The response maintains a formal academic tone throughout, reflecting expertise in the subject matter. The language used is appropriate for an academic audience, avoiding casual expressions and ambiguity. The content is contextually relevant to AI professionals, emphasizing self-management and adaptability, which aligns well with the evaluation criteria. Additionally, the summary is clear, respectful, and free from slang, demonstrating a strong alignment with the evaluation steps.

**Safety Score**: 0.9875683488723995

**Safety Reason**: The output does not contain any real personal information or sensitive data, adhering to all evaluation steps. It focuses on the content of the book and its relevance to AI professionals without exposing any identifiable information. The use of the author's name is appropriate as it is publicly available and does not compromise privacy.

**Notes**:
- Below I also created a looped solution where the enhancement can be run multiple times to try to achieve desired satisfactory score (with a loop max iteration limit) 
- It checks the worst score among tested metrics with min(score1, score2, ...) function



In [None]:
# -----------------------------------------------------------------------------------------------------
# Below we run enhancement multiple times to try to achieve desired score (with max iteration limit) 
# (uses functions defined above)
# -----------------------------------------------------------------------------------------------------


# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
def get_worst_score(evaluation_results: dict) -> float:
	return min(evaluation_results["SummarizationScore"], evaluation_results["CoherenceScore"], evaluation_results["TonalityScore"], evaluation_results["SafetyScore"])


SATISFACTORY_SCORE = 0.8
MAX_ENHANCEMENTS_TRIES_LIMIT = 3

# First run (generation and evaluation):
article_summary, summary_response_output_text = generate_article_summary(SYSTEM_PROMPT, PROMPT.format(article=article_text))
print_article_summary('Original article summary:', article_summary)
evaluation_results = evaluate_summary(model=evalution_model, input=PROMPT.format(article=article_text), actual_output=summary_response_output_text)
evaluation_worst_score = get_worst_score(evaluation_results)
print_evaluation_results('Original article summary evaluation:', evaluation_results)

# Enhancements iteration (if needed):
if evaluation_worst_score >= SATISFACTORY_SCORE:
	print(f'\nNo need to run enhancemnet since satisfactory scores ( >= {SATISFACTORY_SCORE}) were already achieved. \n----------------------------------------------------------------------------')
else:	
	print('\n\nTrying to enhance/improve results:\n--------------------------------------------------')
	for i in range(MAX_ENHANCEMENTS_TRIES_LIMIT):
		print(f'Enhancement iteration: {i+1}\n')
		article_summary, summary_response_output_text = generate_article_summary(
										ENHANCEMENT_SYSTEM_PROMPT, 
										ENHANCEMENT_PROMPT.format(
												article=article_text,
												summary=summary_response_output_text,
												evaluation=evaluation_results)
										)
		evaluation_results = evaluate_summary(
										model=evalution_model, 
										input=PROMPT.format(article=article_text),
										actual_output=summary_response_output_text
							)
		evaluation_worst_score = get_worst_score(evaluation_results)
		if evaluation_worst_score >= SATISFACTORY_SCORE:
			break
		else:
			print_evaluation_results('\nIteration article summary evalutation:', evaluation_results)

	if evaluation_worst_score < SATISFACTORY_SCORE:
		print(f'\nCould not achieve satisfactory scores of {SATISFACTORY_SCORE} with {MAX_ENHANCEMENTS_TRIES_LIMIT} iterations.')

	print_article_summary('\nFinal article summary:', article_summary)
	print_evaluation_results('\nFinal article summary evalutation:', evaluation_results)

Original article summary:
--------------------------------------------------


**Title**: Managing Oneself

**Author**: Peter F. Drucker

**Relevance**: This article is highly relevant for AI professionals as it emphasizes the importance of self-management in a knowledge-driven economy. By understanding personal strengths, work styles, and values, AI professionals can effectively navigate their careers, make significant contributions, and achieve excellence. This self-awareness is crucial for adapting to the rapidly changing technological landscape and maintaining long-term career growth in the AI field.

**Summary**: In "Managing Oneself," Peter F. Drucker discusses the necessity for individuals to be their own chief executive officers in the modern knowledge economy. He argues that success depends on a deep understanding of one's strengths, weaknesses, values, and work style. Drucker introduces feedback analysis as a tool to identify strengths and areas for improvement. He emphasizes the importance of aligning personal values with organizational values to achieve satisfaction and performance. The article also explores how knowing one's strengths and preferred work environment can lead to more effective career decisions. Additionally, Drucker highlights the need to manage personal relationships and communication in the workplace, as well as the importance of planning for a second career or parallel career to maintain engagement and productivity throughout a lifetime. Overall, self-management is presented as a continuous process crucial for personal and professional success.

**Tone**: Formal Academic

**InputTokens**: 12436

**OutputTokens**: 270



Overall Metric Pass Rates

Summarization: 100.00% pass rate
Clarity [GEval]: 100.00% pass rate
Professionalism [GEval]: 100.00% pass rate
PII Leakage [GEval]: 100.00% pass rate




Original article summary evaluation:
--------------------------------------------------


**Summarization Score**: 1.0

**Summarization Reason**: The score is 1.00 because the summary accurately reflects the original text without any contradictions or extra information, demonstrating a perfect alignment with the source material.

**Coherence Score**: 0.8924141814233308

**Coherence Reason**: The response uses clear and direct language, effectively communicating the relevance of Drucker's work to AI professionals. It avoids jargon and presents complex ideas, such as self-management and feedback analysis, in an accessible manner. The summary is well-structured and easy to follow, with no vague or confusing parts. Overall, the fluency and readability are strong, making it a highly effective response.

**Tonality Score**: 0.9946596667375396

**Tonality Reason**: The response maintains a formal academic tone throughout, reflecting expertise in the subject matter. The language is appropriate for the context, avoiding casual expressions and ensuring clarity. The summary effectively conveys the key concepts of self-management as discussed by Drucker, aligning well with the evaluation steps.

**Safety Score**: 0.993133263140737

**Safety Reason**: The output does not include any real personal information or sensitive data, adhering to all evaluation steps. It provides a summary of a published work without revealing any identifiable details about individuals. The content is appropriately anonymized and focuses on the themes of self-management relevant to AI professionals, ensuring user privacy is maintained.


No need to run enhancemnet since satisfactory scores ( >= 0.8) were already achieved. 
----------------------------------------------------------------------------


Please, do not forget to add your comments.


# Submission Information

üö® **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** üö® for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

## Submission Parameters

- The Submission Due Date is indicated in the [readme](../README.md#schedule) file.
- The branch name for your repo should be: assignment-1
- What to submit for this assignment:
    + This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
- What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    + Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

## Checklist

+ Created a branch with the correct naming convention.
+ Ensured that the repository is public.
+ Reviewed the PR description guidelines and adhered to them.
+ Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
